Tuesday, December 01, 2009

Dueling Data Formats


I've been in heated debates over the years about the relative benefits of XML (extensible markup language) and CSV (comma separated value) data structures for managing numeric values (numbers) and small chunks (blocks) of text. These structures provide a framework for giving data meaning (semantics) and for presenting (rendering) the data in reports. This post describes and evaluates the two structures.

Data Meaning

One thing the XML and CSV structures can do is convey (express and transmit) the data's meaning.

XML structures data using "markup tags." These tags are words within angle brackets, which surround a piece of data (called a data element or attribute). For example, consider an XML file with these two lines of text:
  1. <patient first name>John</patient first name>
  2. <patient last name>Jones</patient last name>
Line 1 identifies the data element "John" as the patient's first name and line 2 identifies the element "Jones" as his last name. Thus, the meaning of a data element is conveyed by the tags that surround it. Note that additional tags are used to identify hierarchies (i.e., patent-child relationships between the data).

Using CSV in a novel way accomplishes the same result, but instead of using markup tags, only commas are used. In a CSV, the data are organized in lines of text (called text "strings"), and each piece of data (i.e., each datum) is separated (i.e., "delimited") with a comma. So, in the example above, the text string "John,Jones" is all that's needed. But commas, unlike markup tags, do not assign any meaning to the data. Instead, the meanings of "John" and "Jones" are conveyed by their location (position) within the CSV file, which is determined by their line and comma numbers.

The position of the data can be easily visualized in a data grid (such as a spreadsheet). When you open a CSV file in a data grid, it automatically places each datum in a cell by assigning each line of text to a data grid row, and by separating each datum into its corresponding data grid column. This is depicted in the Excel spreadsheet image below with "John" in cell "A1" and "Jones" in cell "B1."


The meaning of the data in a CSV file are conveyed by instruction in the spreadsheet to identify any data in cell position A1 as the patient's first name and any data in cell B1 as the patient's last name. This is a novel (patented) method for conveying data meaning between a CSV (or other delimited files) and a data grid based on the location/position of the data; I invented this process in the 1990's and have been using it since.

Data Presentation

In addition to conveying the meaning of data, XML and CSV structures also enable the data to be presented (displayed/rendered) in reports. Both XML and CSV methods use templates containing the formatting instructions that render the data. There are multiple ways to present XML data, including the use of JavaScript, CSS, XSL, and XHTML. CSV data, on the other hand, can be presented via spreadsheets and database report writers. Note that it is easy to convert XML to CSV and I have an open source program that demonstrates how to do it at this link.

Simplicity & Efficiency

When dealing with numeric values and short text strings, it's easy to see that CSV data structures are much simpler and more efficient than XML. This is distinction is clearly depicted in continuity of care documents (CCDs) and continuity of care records (CCRs), as shown below, which compare the lab test results section of a CCD, CCR and CSV file. They all contain the same essential data, which are highlighted in yellow.

XML Structured Files: Actual Examples of XML code in a CCD and CCR

CCD Lab Results Section Example

The following is an example of the lab section of a CCD. It contains 2,456 characters.
<component>
<section>
    <templateId root="2.16.840.1.113883.10.20.1.14"/>
    <code code="30954-2" codeSystem="2.16.840.1.113883.6.1"/>
    <title>Results</title>
    <text>
        <table border="1" width="100%">
            <thead>
                <tr><th>&#160;</th><th>2000-03-23</th></tr>
            </thead>
            <tbody>
                <tr><td>HGB (13-18 g/dl)</td><td>13.2</td></tr>
                <tr><td>WBC (4.3-10.8 10+3/ul)</td><td>6.7</td></tr>
                <tr><td>PLT (135-145 meq/l)</td><td>123*</td></tr>         
            </tbody>
        </table>
    </text>
    <entry typeCode="DRIV">
        <organizer classCode="BATTERY" moodCode="EVN">
            <templateId root="2.16.840.1.113883.10.20.1.32"/>
            <id root="7d5a02b0-67a4-11db-bd13-0800200c9a66"/>
            <code code="43789009" codeSystem="2.16.840.1.113883.6.96" displayName="CBC WO DIFFERENTIAL"/>
            <statusCode code="completed"/>
            <effectiveTime value="200003231430"/>
            <component>
                <observation classCode="OBS" moodCode="EVN">
                    <templateId root="2.16.840.1.113883.10.20.1.31"/>
                    <id root="107c2dc0-67a5-11db-bd13-0800200c9a66"/>
                    <code code="30313-1" codeSystem="2.16.840.1.113883.6.1" displayName="HGB"/>
                    <statusCode code="completed"/>
                    <effectiveTime value="200003231430"/>
                    <value xsi:type="PQ" value="13.2" unit="g/dl"/>
                    <interpretationCode code="N" codeSystem="2.16.840.1.113883.5.83"/>                    <referenceRange>
                        <observationRange>
                            <text>M 13-18 g/dl </text>
                        </observationRange>
                    </referenceRange>
                </observation>
            </component>
            <component>
                <observation classCode="OBS" moodCode="EVN">
                    <templateId root="2.16.840.1.113883.10.20.1.31"/>
                    <id root="8b3fa370-67a5-11db-bd13-0800200c9a66"/>
                    <code code="33765-9" codeSystem="2.16.840.1.113883.6.1" displayName="WBC"/>
                    <statusCode code="completed"/>
                    <effectiveTime value="200003231430"/>
                    <value xsi:type="PQ" value="6.7" unit="10+3/ul"/>
                    <interpretationCode code="N" codeSystem="2.16.840.1.113883.5.83"/>
                    <referenceRange>
                        <observationRange>
                            <value xsi:type="IVL_PQ">
                                <low value="4.3" unit="10+3/ul"/>
                                <high value="10.8" unit="10+3/ul"/>
                            </value>
                        </observationRange>
                    </referenceRange>
                </observation>
            </component>
            <component>
                <observation classCode="OBS" moodCode="EVN">
                    <templateId root="2.16.840.1.113883.10.20.1.31"/>
                    <id root="80a6c740-67a5-11db-bd13-0800200c9a66"/>
                    <code code="26515-7" codeSystem="2.16.840.1.113883.6.1" displayName="PLT"/>
                    <statusCode code="completed"/>
                    <effectiveTime value="200003231430"/>
                    <value xsi:type="PQ" value="123" unit="10+3/ul"/>
                    <interpretationCode code="L" codeSystem="2.16.840.1.113883.5.83"/>
                    <referenceRange>
                        <observationRange>
                            <value xsi:type="IVL_PQ">
                                <low value="150" unit="10+3/ul"/>
                                <high value="350" unit="10+3/ul"/>
                            </value>
                        </observationRange>
                    </referenceRange>
                </observation>
            </component>
        </organizer>
    </entry> 
</section>         
</component>

CCR Lab Results Section Example

The following is an example of the lab section of a CCR. It contains 1,241 characters, which is about half the size of the CCD.
<Results>
  <Result>
      <Test>
        <DateTime>
            <Type>
                <Text>Collection start date</Text>
            </Type>
            <ExactDateTime>2000-03-23</ExactDateTime>
        </DateTime>
        <Description>
            <Code>
                <Value>30954-2</Value>
                <CodingSystem>LOINC</CodingSystem>
            </Code>
            <Text>HGB</Text>
        </Description>
          <TestResult>
              <Value>13.2</Value>
              <Units>
                <Unit>g/dl</Unit>
            </Units>
          </TestResults>
        <NormalResult>
            <Normal>
                <Description>
                    <Text>13 - 18</Text>
                </Description>
            </Normal>
        </NormalResult>
    </Test>
      <Test>
          <DateTime>
            <Type>
                <Text>Collection start date</Text>
            </Type>
            <ExactDateTime>2000-03-23</ExactDateTime>
        </DateTime>
        <Description>
            <Code>
                <Value>33765-9</Value>
                <CodingSystem>LOINC</CodingSystem>
            </Code>
            <Text>HGB</Text>
        </Description>
        <TestResult>
            <Value>6.7</Value>
            <Units>
                <Unit>10+3>ul</Unit>
            </Units>
        </TestResult>
        <NormalResult>
            <Normal>
              <Description>
                <Text>4.3 - 10.8</Text>
              </Description>
            </Normal>
        </NormalResult>
    </Test>
    <Test>
        <DateTime>
            <Type>
                <Text>Collection start date</Text>
            </Type>
            <ExactDateTime>2000-03-23</ExactDateTime>
        </DateTime>
        <Description>
            <Code>
                <Value>26515-7</Value>
               <CodingSystem>LOINC</CodingSystem>
            </Code>
            <Text>PLT</Text>
        </Description>
            <TestResult>
              <Value>123*</Value>
            <Units>
                <Unit>meg/l</Unit>
            </Units>
        </TestResult>
        <NormalResult>
            <Normal>
                <Description>
                    <Text>135-145</Text>
                </Description>
            </Normal>
        </NormalResult>
    </Test>
  </Result>
</Results>

CSV Lab Results Data

The following shows the actual contents of a CSV file containing the same essential lab data as the CCD and CCR. It can produce the exact same report as the CCD and CCR, yet contains only 128 characters, which is about 1/20th (5%) the size of the CCD:
Results
HGB, g/dl,13-18,13.2,30954-2,2000-03-23
WBC, 10+3/ul,4.3-10.8,6.7,33765-9,2000-03-23
PLT, meq/l,135-145,123,26515-7,2000-03-23

Summary of the CCD, CCR, and CSV Differences

The differences between the CCD, CCR, and CSV are as follows:
  1. The CCD uses XML markup tags along with embedded HTML formatting tags, as well as extensive
  2. The CCR also uses XML, but contains no HTML and has much less metadata. It requires half as many characters as the CCR, but 10 times the CSV.
  3. The CSV file has no markup tags of any kind and only minimal metadata. It has the fewest characters by far. Like the other two formats, the CSV can generate CCR and CCD reports containing the same data elements.
  4. The CSV data structure is also much simpler than the XML structures.
  5. The CCD and CCR use complex Extensible Stylesheet Language (XSL) templates to transform and render the XML files, while the CSV is used by a spreadsheet template.
Conclusion

It's indisputable that the CSV data format has much greater simplicity and efficiency than XML, especially when storing and transmitting numeric data and small chunks of text. Nevertheless, XML data formats are more popular because few people have recognized that using data grids (spreadsheets) in a novel way: (a) conveys the meaning of the data stored in a CSV file and (b) enables the CSV's contents to be presented in rich reports.

But what's so important about simplicity and efficiency? Well, for one thing, it's a benefit when conservation of resources is important, such as minimizing bandwidth use, computer processing time, and storage space. And conservation of resources is crucial when time and money are important considerations, as well when accommodating people who lack the luxury of broadband Internet, large hard drives, and high speed computers. Also, in a world where technology is becoming increasing complex, simplicity is a "breath of fresh air;" in fact, we ought to be making things less costly and complex, not more!

Continued at this link.
Post a Comment