Customer# Name Phone
ROW 12 Bob Smith 772-4500
ROW 13 MegaCorp 971-0504
Order# Widget Type Customer# Supplier#
ROW 100 Yellow 12 1033
ROW 101 Red 12 2022
ROW 102 Blue 13 1033
Supplier# Name Phone
ROW 1033 Acme Widget Company 232-3450
ROW 2022 Widgets'R'Us 122-3992
Here you see two tables and a "link" between the two tables using the column Order#.
There are several common idioms that are adhered to when designing a relational database system. First of all, every table will typically be given a unique number (in this case Customer# and Order#). This number is typically a kind of "throw away" number that is never, ever shown to the user. If the user of the program needs to be able to look at an ORDER number, this number is usually created separately with a special name (like ExternalOrder#, maybe), to distinguish it from the unique internal Order#. Why is this? Simply because relational database systems are so dependent on these numbers to keep track of linkages between tables that it may conflict with actions that the user may want to do to with an Order#. For instance, a user may want to change the Order# or synchronize order#s between databases- something that cannot easily be allowed if the Order# functions as a link to other tables.
Another common idiom is to never have the same piece of data duplicated in several places in the database. Data duplicated in several places is called "de-normalized" data. By using a simple method called "database normalization", such duplicates are usually removed by creating new tables specifically to hold the duplicated data.
Relational database systems turned out to be perfect for business applications, for several reasons:
- Companies operate under the principles of a business model, which always dictates that the company only sell a fixed number of different services or products: Whether its a McDonalds or Pfizer, a business limits itself in the products it offers in order to maintain consistency and benefit from efficiencies of scale. This "fixedness" of business data allows it to be comfortably mapped onto the fixed table rows that a relational database requires.
- The relational database model allows data to be stored relatively efficiently, because the data that is stored is free of "metadata" which is a term for data that describes other data. The fixed length of the fields allows the program to "know" what each piece of data means without needing to put extra descriptive information to say "what" each piece of data is. Since large companies tend to have incredibly large amounts of data in their databases, and since computer storage in the 60s, 70s, and 80s was very expensive, this made relational models ideally suited for business.
XML and other Markup languages
Before we delve into XML, let's take a look at the concept of "syntax free" data structures. The question behind "syntax free" data structures (which, arguably, would better be described as "minimal syntax") is as follows: What is the least amount of "structure" data has to have in order to describe an arbitrarily complicated piece of data in an "understandable" fashion? One natural answer to this question is to break the data up in a tree-like manner, where each branch of the tree has two things in it: One thing explains what type of data the branch contains, one contains the actual data.
In the end, the data still needs to have some kind of syntax to tell you what the branches are and what parts describe types and what parts are just raw data. In A.I. (especially among researchers involved with the programming language LISP), a common method for accomplishing this was by using "syntax expressions". In this format, each branch of the tree is represented as a pair of parentheses. Within the parentheses, the first item describes the "type" of the data, the rest denote the actual data. Our previous example, for instance, could be described as follows:
(customer "Bob Smith" "772-4500"
(widget "yellow" (supplier# 1033))
(widget "red" (supplier# 2022)))
(customer "MegaCorp" "971-0504"
(widget "blue" (supplier# 1033))))
(supplier "Acme Widget Company" "232-3450" (supplier# 1033))
(supplier "Widgets'R'Us" "122-3992" (suppler# 2022))
In this case, the "types" for the branches are "supplier", "customer", and "widget". As you can see, the data is represented as a tree instead of a set of tables. Since the related pieces of data are stored together in the same place, we need fewer weird "unique ids" to link data. Additionally, the layout of the data is natural for a human to understand, since items that are related to each other are not in completely different places, as they would be in a relational database system. This type of format is basically identical to HTML, which was a central part of the internet revolution. The purpose of HTML (which is the language of Web pages) was to allow the representation of arbitrarily complicated book/magazine/newspaper-like documents, containing different text styles, pictures, tables and other layouts that human authors prefer for organizing information for other humans to read. In HTML, however, the type information is denoted by putting it in brackets, so a piece of text that is meant to be in bold text has <B> in front of it and </B> at the end of it.
A more general sibling of HTML is XML, which takes the ideas of HTML and applies them to any type of data, not just web-like documents. For instance, the same data described above can be represented in XML as follows:
<customer>Bob Smith <phone>772-4500</phone>
<WIDGET>yellow <supplier#>1003</supplier#> </WIDGET>
<supplier>Acme Widget Company<phone>232-3450</phone><supplier#>1033</supplier#></supplier>
As you can see, this format is analogous to that of the "syntax-expressions" format, invented back in the dawn of the of the A.I. era. (XML, however, does add innumerable extra flourishes such as header info linking to extra descriptive files called DTDs, foreign character support through UNICODE, support for namespaces, internal linkages, etc.) These newer data formats, in general called "markup languages", compromise the second revolution in knowledge representation described in this primer.
Now that we've looked briefly at XML, let's look at some of the things that these knowledge representation systems can't handle very well.
What RDBMses, XML, and Other Commonly Used KR Systems Can't Do Well
Since I work in the medical software field, I am exposed everyday to the limitations imposed on medical informatics by the constraints of these common representation methods. It may be surprising to an outsider that these limitations are so hard to overcome- After all, isn't medicine just "another business"? Wouldn't the same representational systems that proved so effective to business software work equally well in medicine? I believe the answer is no.
The fact is that medicine is qualitatively different from any regular business- And I don't just mean this in a mushy "because it involves human lives" kind of way (although that's a good reason too)... It really is something different. Remember what we said about typical businesses: Because they have a well-defined, finite business model, it is relatively straight-forward to translate the rules of the business into a computer information system.
However, in medicine there can really be no business models: A patient may walk into a clinic with diabetes, osteoporosis, hypertension, or any other medical condition and the clinic needs to be able to address it. Every human disease has quirks that require many unique data representations that are hard to fit neatly into tables- Information involving patients is therefore very difficult to store into fixed-length database rows. Although this limitation can always be overcome by adding additional tables for new types of data, this strategy eventually becomes unwieldy: After all, medicine is filled with an innumerable number of exceptions and idiosyncracies.
Because of the empirical nature of medicine, it is very difficult for software developers to encode all the structured parts of the medical business inside a software application- This is how business software is typically designed: The operator of the program must interact with the system at any point that he/she wants to create structured data (such as pressing a button to add a new widget to the database, for instance)- An end user would never, for instance, create a new rdbms database column or a new XML tag, because these are to unwieldy for a domain expert without computer expertise to interact with directly from the standpoint of an end user.
But since medicine is so unpredictable and not driven by a pre-determined structure, this is exactly what would be needed for a truly powerful medical system: The doctor would need to be able to enter fully structured information directly into the system in a manner that cannot be predicted ahead of time by any software developers.
In a way, an ideal medical software entry system would need to allow a clinician to create new rdbms tables/new XML tags on the fly- Something that is not practical with the current tools available to the Guys in the Garage. However, some of the concepts developed by The Scientist, which we will be discussing in the next few sections, are able to offer some possible solutions to this dilemma. This is why I think science needs to become more critical in solving the many remaining IT software problems than it has been in the past.
How Scientists Think About Knowledge >>