How To Tell Stuff To A Computer

Mail

Shortcuts For Experts

Intro
RDBMS/XML
FOL
Frames
Description Logics
A.I.
RDF
UMLS
Google
Conclusion

Edging Towards The Center

If we look back at the triangle I suggested in the introduction of this primer, it suggests that the most challenging, but also most interesting types of knowledge system would require collaboration between all 3 major types of knowledge architects: It would absorb the pragmatic utility of the guys in the garage, have the strong theoretical underpinnings underpinnings of the scientists, yet would still remain accessible to writers that can contribute information from a wide variety of domains.

In order to build such as system, one would need to take information entered by the many writers and somehow "mark up" this information in a scientifically sophisticated fashion, allowing a computer system to manipulate the data in an abstract manner. Then, the various bits of information would need to be stitched together in a fashion that makes it practical for a person to browse/query the information in a flexible manner that may be different from the way the authors had originally intended the information to be used- In essence, we would want to infuse our information with "science" in a way that allows us to leverage the entered data and maximize its value.

There are three main approaches I can see that have been attempted to attain this problem- These are shown near the center of the knowledge triangle:

The most ambitious approach involves taking a strongly scientific mindset, as exemplified by the RDF and semantic web. The second involves a more pragmatic, guy in the garage approach, exemplified by the UMLS (the Unified Medical Language System is a system that tries to organize medical information). The third method for achieving this type of system involves sacrificing both scientific precision and simplicity to minimize the effort required by the writer. This third approach is exemplified by the advanced search technologies currently being developed by companies such as Google, Microsoft and Yahoo. Of course, a certain amount of creative tweaking is necessary to fit these ideas so neatly into such a simple graphic, but it allows us to capture some simple truths about these different methodologies. Let's look at each of these in more detail...

RDF and The Semantic Web

The Semantic Web is a vision outlined by Tim Berners Lee, the principal designer of most of the technology that underlies the world wide web and hence was a major enabler of what I have called the second computer revolution. He developed this idea in response to his dissatisfaction with the unstructured nature of the current way that information is generated for the internet. His vision of the future appears similar to that described in this primer.

At its core, the Semantic Web can be thought of as a methodology for linking up pieces of structured and unstructured information into commonly-shared description logics ontologies. For instance, suppose that I owned a sandwich shop and had my menu online, which happened to include the same grilled chicken sandwich that we had discussed previously in our chapter about description logics. If we wanted to, we could take our ontology for sandwiches we had set up earlier and publish it online (in a special format), then link our menu to that published ontology, also in a special format. This special format is callled RDF, which can be used both to describe the items in our description logic, but also is used within the menu to create the actual "link".

However, creating our own ontology is not actually necessary: If an ontology is already available to describe a grilled chicken sandwich then we can directly link to that sandwich instead of creating our own. For instance, after searching online for some RDF ontologies that covers sandwiches, I was able to find wine.rdf on the www.w3.org website. It has a food item called "LightMeatFowlCourse", which roughly describes my grilled chicken sandwich. So when I write the HTML for my website, I could put the following text into my HTML document:



<html>
<head>
<title>Sandwich Shop Menu</title>
<rdf:RDF
	xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"        
	xmlns:ccf="http://www.wwaac.org/2003/10/ConceptCodingFramework-ns#"
        xmlns:food="http://www.w3.org/TR/2004/REC-owl-guide-20040210/wine.rdf">
        <rdf:Description rdf:about="#grilled_chicken_sandwich">
             <ccf:Concept rdf:resource="food:#LightMeatFowlCourse"/>
        </rdf:Description>
</rdf:RDF>
</head>
<body>
	<a name="grilled_chicken_sandwich">House Special Grilled Chicken Sandwich</a> -You'll Love this Sandwich! Juiciest in Town- Just $4.95!
</body>
</html>
</pre>

There's several things going on in this example I need to explain.

First of all, the way I added the RDF information is somewhat arbitrary (adapted from a W3C Draft and this doc on markup strategies). At this time marking up web sites with RDF is still a relatively new practice and there is no hard and fast standard for how to mark up web pages. This is partly due to the fact that HTML itself is considered an antiquated format that is meant to be replaced with XHTML.

If you look at the layout of the information in this sample, note that all the RDF information is in the header, whereas the actual visible web page data is down at the bottom and is free of any RDF information. This is because RDF assumes that any "nameable" entity on the internet should have its own URI (Universal Resource Identifier) and therefore the chicken sandwich needs to be tagged indirectly: In this example, the sandwich name is surrounded by a bookmark tag () which allows it to be referred to by the name grilled_chicken_sandwich. Most examaples of RDF you will currently find on the web will be used to link an entire page to a concept, so the rdf in the header can just link to the document as a whole without any bookmarks.

Now let's look at the actual body of the RDF...All the data is stored between the tags.

At the top of the rdf section, we define some namespaces- Remember that RDF is designed for vocabulary from different sources to be used together seamlessly; by coming up with prefixes (rdf, ccf, and food) for each source, we are able to refer to each source ambiguously- Note that the definitions of each source prefix has a link to a description-logic-style ontology published somewhere on the internet. The rdf namespace is used to point to a vocabulary of basic RDF tag semantics. The ccf namespace let's gives us a vocabulary for linking items to concepts. The food namespace enables us to refer to food items.

After our namespaces are declared, we can now define our grilled_chicken_sandwich using the rdf:Description tag. We do this by stating that the description is about "#grilled_chicken_sandwich" and that this sandwich can be described by the concept "food:#LightMeatFowlCourse".

So we tagged a web page with RDF... so what the heck is all this good for?

We have now tagged the grilled chicken sandwich from our menu with a description logics ontology that can be used to discuss different food articles in a consistent manner. Since "LightMeatFowlCourse" is very vague description for our sandwich, we could also publish our own RDF vocabulary- Better yet, RDF has special features that exist to map the common vocabulary to our own- This makes it possible to have both the specificity of our own vocabulary, while maintaining compliance with a standard vocabulary. But what good does it do to have web information linked with RDF anyway?

Well, at this time, not much. Notice how much effort it required for us to create a single link- And notice that none of this effort in any way directly enhances the web site for a user who is perusing our sandwich menu. In order for our users to benefit from this extra effort, they would require some sort of new type of browser, a semantic web browser. This mythical browser would be able to "understand" the different types of sandwiches on our page, and would be allow us to use this information in new ways: Maybe it could find other places that sell chicken sandwiches on the web, maybe it could ferret out nutritional information, etc. etc. While a search engine like Google might let us do this already, a semantic web browser would allow us to do it in a way that is far more scientifically robust. Of course, the fact that the closest syntactic match currently available is "LightMeatFowlCourse" would make it pretty unlikely that such inferences could be made at this time. Notice, however, that the reason we couldn't find a close match for our sandwich is that the database we used was created by a writer who is knowledgeable in Oenology (the study of wines) and not chicken sandwiches. By linking our sandwich to this vocabulary from another domain, we have created a bridge that potentially could provide many side benefits. A semantic web browser, for instance, might be able to tell us which wines would be appropriate for our sandwich, leveraging our information in new and unexpected directions. Of course, we had to put in a lot of effort for only such a tiny benefit, and therein lies the rub: Any single RDF link has only limited utility and only becomes valuable as the overall availability of RDF data on the internet reaches a critical mass.

Cognitive Load Theory and the Semantic Web

In my opinion, however, the most difficult obstacle to the success of the semantic web is an issue of human psychology: The Writer pays a huge cost when building RDF linkages into any knowledge that they publish. This can be appreciated when looking at results of studies related to Cognitive Load Theory. This theory states that humans have a limited amount of working memory available when solving problems, and that any extra and constant cognitive tasks that needs to be performed in concert with other tasks will drastically decrease a human's performance. This theory might explain, for instance, why it is much more difficult to understand text when it is written in ALL CAPS, since reading such text requires a constant, if seemingly trivial, extra effort to parse words- our brains are less accustomed to this format than reading regular text. This extra parsing leaves less working memory for us to process the ideas contained within the text, leading to decreased comprehension.

In a similar way, I believe, forcing The Writer to constantly tag his ideas in his unstructured documents with RDF information causes a similarly drastic decrease in his/her productivity. On the other hand, creating simple unstructured text is much more cognitively compatible with how humans function; this means that the benefits of creating content that has strong semantic linkages is far more costly than creating regular unstructured content, meaning it will only become common practice once the benefits of creating such documents outweighs their cost, which is not yet possible in the current internet environment.

In summary, RDF and the semantic web are promising ideas that still have some major shortcomings before achieving widespread adoption. As indicated in our triangle of knowledge, RDF and the semantic web are mostly a scientific approach to the problem of knowledge representation, meaning they tend to be limited both in their pragmatism (embodied by the guy in the garage) and in terms of their friendliness for content authors (embodied by The Writer).

UMLS and the Guys in The Garage >>