How To Tell Stuff To A Computer

Mail

Shortcuts For Experts

Intro
RDBMS/XML
FOL
Frames
Description Logics
A.I.
RDF
UMLS
Google
Conclusion

The Guys In The Garage and the UMLS

Now that we've seen some of obstacles encountered by The Scientist in solving the knowledge representation conundrum, let's look at what The Guys in the Garage have done to try and tackle this problem. Remember that the Guy in the Garage is interested mainly in developing solution to very specific and current problems in order to develop practical software solutions now, building functioning software for existing customers. One obvious strategy that help make practical KR systems easier to develop is to limit the domain of the system to a specific industry- And one industry is currently, by far, the most active in fostering the development of practical knowledge systems: The medical industry.

Due to its empirical nature, medicine is an incredible contortion of unstructured and highly complex information. Much of the advanced knowledge contained in medical information systems is currently still very proprietary, with many software companies specializing in software for specific medical specialties- After all, the Guy in the Garage knows that "standards" for such information is just not very practical at this time.

Of course, many standards have been established in medicine- After all, there is a substantial amount of medical informatics research taking place at academic centers. In fact, some of these standards (SNOMED-CT, for instance) are very similar in philosophy to the semantic web. However, some the solutions that have been most heavily adopted are still very pragmatic at their core (HL7, IDC9, CPT, etc.) Of these, more pragmatic systems, the most interesting in terms of impacting the future of KR systems is arguably the UMLS.

One of the big long-term goals of medical informatics is to develop a Universal Medical Record for all patients- Basically, a computer file a person can take with them wherever they go. In a perfect world, any software a particular hospital uses would be able to understand this record because it is written in some kind of universally understood medical language- An difficult challenge in knowledge representation. The UMLS is an early step in making such a system possible. The UMLS is basically an epic attempt to unify all medical vocabularies into a single hierarchy, like a kind of super-thesaurus. Surprisingly, most current computerized vocabularies use completely incompatible ways of classifying diseases. One vocabulary might organize certain diseases as acute or chronic. Another might organize some disease based on the location, or maybe it may organize it underneath another disease which it is associated with- there is no uniform way (yet) of deciding how to classify such things.

Take, for example, Addison's disease. (caused when the adrenal glands do not produce enough cortisol.) Below is a list of some common medical vocabularies and how they classify Addison's disease:

SNOMED: 
   "Addison's Disease" is a member of 
   "Diseases of the Adrenal Glands" is a member of
   "Diseases of the endocrine system" is a member of
   "Diseases/Diagnoses"

ICD-10:
   "Primary adrenocortical insufficiency" is a member of
   "Other disorders of adrenal gland" is a member of
   "Disorders of other endocrine gland"

MeSH:
   "Addison's Disease" is a member of
   "Adrenal Gland Hypofunction" is a member of
   "Adrenal Gland Diseases" is a member of
   "Endocrine Diseases" is a member of
   "Diseases"

This example on Addison's disease is borrowed from the presentation "The Unified Medical Language System (UMLS): What Is It and How to Use It?" by Oliver Bodenreider, Jan Willis, and William Hole.

In order to cope with the incredible chaos these competing classification schemes bring into play, the UMLS forgoes any attempt to organize information into a logically rigorous fashion, focusing instead on being comprehensive and practical, ala Guys in the Garage- It is basically just a hodgepodge of source vocabularies without any kind of true description logic or precise ontology at its core, by design. However, It is still an incredible accomplishment, since it can tie together over a hundred distinct source vocabularies and contains many millions of terms, which it is able to map between the disparate sources. Even more remarkably, it is a true superset of many of its sources- A complete UMLS database (many gigs in size) can be algorithmically be manipulated to generate a byte-for-byte exact duplicate of many of its source database.

It consists of 2 main parts: The Metathesaurus and the Semantic Network.

The UMLS Metathesaurus

This is basically a grandaddy of medical thesauruses- It is the core of the UMLS. It has been generated by processing all of the source vocabularies of the UMLS, using a combination of automated processing software, along with extensive hand-editing by human editors. All the vocabulary is organized into Concepts, Terms, Strings, and Atoms.

If the processing software determines that a vocabulary name has not yet been seen in another vocabulary, a human editor will try and determine if it is just a different name for something already in the database, or if it is completely new. If it is new, then a new Concept is created in the UMLS, and is given a unique concept id (called a CUI). So, for instance, somewhere in the UMLS there would be a concept of "Headache".

If it is just a complete different name for a known concept, it is linked to that concept's CUI and is declared a new Linguistic Term for that concept. So, for instance, "Cephalgia" is a different Linguistic Term linked to headache. Each is given a unique linguistic identifier (called an LUI).

If it is just a minor textual variant of a known term, then it is just declared a new String and given a unique string id (called an SUI). So, for instance, capitalized "Headache" would be a different string from lowercase "headache".

Finally, since the thesaurus carefully tracks where each vocabulary item originates from, each appearance of a vocabulary item is assigned a unique Atom (referenced with the so-called AUI) which is a location in the database where detailed information is stored as to where this new appearance of the word was found.

The Semantic Network

Although no rigorous logical framework exists inside the UMLS, there is a simplistic "semantic network" that allows one to determine the basic relationship between two related concepts- Again, this is just a pragmatic add-on to the main database that is not guaranteed to give any kind of scientifically rigorous answers. It consists of a web of <100 common abstract ideas. For instance, it contains the idea that a disease may be found in an organ:

"Body Part, Organ, or Organ Component" --location-of--> "Disease or Syndrome"

... Using data in the UMLS, it one can find linkages to this idea from Addison's disease, for instance- This then makes it possible to learn that the Adrenal Cortex is a part of an organ and that Addison's disease is a disease that is located in that organ- But, again, only in a somewhat haphazard fashion.

The UMLS and the Future of Medical Software

Although it is hard to know what knowledge representation in medical software of the future will look like, it is clear that it will most likely involve some kind of system in which doctors and other domain experts (the Writers) can enter scientifically rigorous information directly into a software application. Such a software system would have an extensive understanding of differing medical vocabularies in order to cope with all entered data, even if it is entered in unexpected formats. Having a super medical thesaurus, like the UMLS, will be a key step in making such a system possible.

The Skunkworks at Google >>