Title Image

Don Xml's Grok This

The home of Don Demsak
Welcome to Don Xml's Grok This Sign in | Help
in Search

This Blog

Syndication

Site Sponsors

DonXml's All Things Techie

Heads Down and Consumed By Full Text Indexing

Yes, it has been awfully quiet here since I came back from TechEd.  I’ve noticed my own silence, and I’d love to blog more, but I’ve totally consumed by some research I’m doing in the Full Text Indexing realm.  I’m a late comer to FTS, so I’m working hard at coming up to speed (as best as I can) on the subject.  It started because of my current client, and the indexing of semi-structured documents that I’m doing for them.  Because they are trying to index concepts and not key words, and it is text data mining based, this isn’t your typical project.  Basically, they have a known set of concepts that they wish to index on.  In there current “system” they have people (aka indexers) read every document and add references (metadata) between the document and their concepts.  This way the documents can be “searched” by concept and not literals found within the text.  What we are doing is trying to augment the process with as much automation as possible. 
The first version of automation is the simplest one, trying to come up with a list of literal strings that can be found in the document text, and map them to the appropriate concept.  Since they already have a pre-existing system (that includes a list of all the concepts), I tried my hand at generating the index to concept.  The problem is that English words morphological and inflexional endings that you either have to remove from the source text, or add onto the concept text (if you wanted to build an index based on the literal).  I’ve found The Porter Stemming Algorithm, which is a standard algorithm used this very purpose.  But, a concept can (and usually) contain more then one word, so not all variations of a word make sense when combine with the variations of the other words.  And, to make things even more difficult, the majority of text in the documents they are indexing is science oriented, so the stemming algorithm isn’t not very useful for the scientific words.  Even worse, they also need to index chemical compounds, which for some reason there is no standard way of naming a chemical compound

Example: caffeine
Chemical Formula: C8H10N4O2 (or it can be express C8 H10 N4 O2 or C8 N4 O2)
Chemical Names:  1,3,7-trimethylxanthine
   1,3,7-Trimethyl-2,6-dioxopurine
   3,7-Dihydro-1,3,7-trimethyl-1H-purine-2,6-dione
   7-Methyltheophylline
   1H-Purine-2,6-dione, 3,7-dihydro-1,3,7-trimethyl- (9CI)
   1-methyltheobromine
   1,3,7-trimethyl-Xanthine
Try mining a text document (or the internet) for the concept of “caffeine”, especially when people are purposely trying to make it hard for you to mine this information.

It is definitely not your typical Google or Lucene type system. 

But, there is hope.  There are organizations like the Open Bioinformatics Foundation, who encourage open source solutions for mining biological text (things like proteins and amino acids), and the National Center for Biotechnology Information.  I even found GeneWays which is “A System for Mining Text and for Integrating Data on Molecular Pathways”.  This is all doctorate level material, which can be a bit heady for a guy that never finished his bachelor degree.

The paper I’ve been using the most to help develop this indexing system is Analysis of Biomedical Text for Chemical Names: A Comparison of Three Methods, which has lots of great information.  The problem is that most of the technology used to accomplish the text mining is based on Natural Language Systems, which isn’t something the average developer can support.  I’ve been working on building my own tokenization algorithms (for the .Net port of Lucene, dotLucene), along with studying Bayesian Analysis and Support Vector Machine (libsvm is great and there is even a C# port).

More then enough to keep me too busy to blog.  But don’t worry.  I’ve got a new open source project that I’m working as my summer project.  Can’t say what it is yet, but it involves this stuff and SQL Server 2005.  I’m still trying to get it all running, but once it sort of works, I’ll let everyone know.

Published Friday, July 08, 2005 10:13 PM by donxml

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

Scott said:

Welcome to my world Don ;)

I spent all of my college years learning different ways to name chemical compounds. There are a billion and that doesn't include all of the really old methods that some chemists still use. ("Muriatic acid" instead of HCL).

I sympathize, we're starting to index and codify our pathologists dictations. Those folks are slippery, they don't want to get pinned down as having said anything. The only thing we've found that really, really works is human review of the parsed data.
July 9, 2005 1:35 AM

Louis Davidson said:

Wow that sounds like a really cool system! Makes the data warehouse I am working on seem kinda boring :)
July 9, 2005 2:14 AM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About donxml

I’m an independent consultant, specializing in .Net solutions architecture, based out of New Jersey who also doubles as an evangelist for XML, Domain Driven Design, enterprise architecture and .Net. I do not work for Microsoft, the W3C or any other big company that you may know of (at least not yet). I’ve been an indie for over ten years, and although I’ve been tempted a couple times to take a job with companies like Microsoft, I’ve haven’t found something better than my current situation. I work mostly with the large pharmaceuticals that are based here in New Jersey, and usually find myself on long term contracts. Definitely not the prototypical indie consultant, but it lets me dedicate time to my non-income generating activities like the developer community stuff, plus financing open source projects like XPathmania and MVP-XML. If you would like to talk to me about doing some contract work, just contact me via the contact page. My rates vary widely, depending on lots of different variables, but mostly distance from Jersey, and type of work. Plus, I’ve been known to donate some of my code for various projects.
Powered by Community Server, by Telligent Systems