Dueling Data Formats
I've been in heated debates over the years about the relative benefits of XML (extensible markup language) and CSV (comma separated value) data structures for managing numeric values (numbers) and small chunks (blocks) of text. These structures provide a framework for giving data meaning (semantics) and for presenting (rendering) the data in reports. This post describes and evaluates the two structures.
Data Meaning
One thing the XML and CSV structures can do is convey (express and transmit) the data's meaning.
XML structures data using "markup tags." These tags are words within angle brackets, which surround a piece of data (called a data element or attribute). For example, consider an XML file with these two lines of text:
- <patient first name>John</patient first name>
- <patient last name>Jones</patient last name>
Using CSV in a novel way accomplishes the same result, but instead of using markup tags, only commas are used. In a CSV, the data are organized in lines of text (called text "strings"), and each piece of data (i.e., each datum) is separated (i.e., "delimited") with a comma. So, in the example above, the text string "John,Jones" is all that's needed. But commas, unlike markup tags, do not assign any meaning to the data. Instead, the meanings of "John" and "Jones" are conveyed by their location (position) within the CSV file, which is determined by their line and comma numbers.
The position of the data can be easily visualized in a data grid (such as a spreadsheet). When you open a CSV file in a data grid, it automatically places each datum in a cell by assigning each line of text to a data grid row, and by separating each datum into its corresponding data grid column. This is depicted in the Excel spreadsheet image below with "John" in cell "A1" and "Jones" in cell "B1."
The meaning of the data in a CSV file are conveyed by instruction in the spreadsheet to identify any data in cell position A1 as the patient's first name and any data in cell B1 as the patient's last name. This is a novel (patented) method for conveying data meaning between a CSV (or other delimited files) and a data grid based on the location/position of the data; I invented this process in the 1990's and have been using it since.
Data Presentation
In addition to conveying the meaning of data, XML and CSV structures also enable the data to be presented (displayed/rendered) in reports. Both XML and CSV methods use templates containing the formatting instructions that render the data. There are multiple ways to present XML data, including the use of JavaScript, CSS, XSL, and XHTML. CSV data, on the other hand, can be presented via spreadsheets and database report writers. Note that it is easy to convert XML to CSV and I have an open source program that demonstrates how to do it at this link.
Simplicity & Efficiency
When dealing with numeric values and short text strings, it's easy to see that CSV data structures are much simpler and more efficient than XML. This is distinction is clearly depicted in continuity of care documents (CCDs) and continuity of care records (CCRs), as shown below, which compare the lab test results section of a CCD, CCR and CSV file. They all contain the same essential data, which are highlighted in yellow.
XML Structured Files: Actual Examples of XML code in a CCD and CCR
CCD Lab Results Section Example
The following is an example of the lab section of a CCD. It contains 2,456 characters.
<component>
<section>
<templateId root="2.16.840.1.113883.10.20.1.14"/>
<code code="30954-2" codeSystem="2.16.840.1.113883.6.1"/>
<title>Results</title>
<text>
<table border="1" width="100%">
<thead>
<tr><th> </th><th>2000-03-23</th></tr>
</thead>
<tbody>
<tr><td>HGB (13-18 g/dl)</td><td>13.2</td></tr>
<tr><td>WBC (4.3-10.8 10+3/ul)</td><td>6.7</td></tr>
<tr><td>PLT (135-145 meq/l)</td><td>123*</td></tr>
</tbody>
</table>
</text>
<entry typeCode="DRIV">
<organizer classCode="BATTERY" moodCode="EVN">
<templateId root="2.16.840.1.113883.10.20.1.32"/>
<id root="7d5a02b0-67a4-11db-bd13-0800200c9a66"/>
<code code="43789009" codeSystem="2.16.840.1.113883.6.96" displayName="CBC WO DIFFERENTIAL"/>
<statusCode code="completed"/>
<effectiveTime value="200003231430"/>
<component>
<observation classCode="OBS" moodCode="EVN">
<templateId root="2.16.840.1.113883.10.20.1.31"/>
<id root="107c2dc0-67a5-11db-bd13-0800200c9a66"/>
<code code="30313-1" codeSystem="2.16.840.1.113883.6.1" displayName="HGB"/>
<statusCode code="completed"/>
<effectiveTime value="200003231430"/>
<value xsi:type="PQ" value="13.2" unit="g/dl"/>
<interpretationCode code="N" codeSystem="2.16.840.1.113883.5.83"/> <referenceRange>
<observationRange>
<text>M 13-18 g/dl </text>
</observationRange>
</referenceRange>
</observation>
</component>
<component>
<observation classCode="OBS" moodCode="EVN">
<templateId root="2.16.840.1.113883.10.20.1.31"/>
<id root="8b3fa370-67a5-11db-bd13-0800200c9a66"/>
<code code="33765-9" codeSystem="2.16.840.1.113883.6.1" displayName="WBC"/>
<statusCode code="completed"/>
<effectiveTime value="200003231430"/>
<value xsi:type="PQ" value="6.7" unit="10+3/ul"/>
<interpretationCode code="N" codeSystem="2.16.840.1.113883.5.83"/>
<referenceRange>
<observationRange>
<value xsi:type="IVL_PQ">
<low value="4.3" unit="10+3/ul"/>
<high value="10.8" unit="10+3/ul"/>
</value>
</observationRange>
</referenceRange>
</observation>
</component>
<component>
<observation classCode="OBS" moodCode="EVN">
<templateId root="2.16.840.1.113883.10.20.1.31"/>
<id root="80a6c740-67a5-11db-bd13-0800200c9a66"/>
<code code="26515-7" codeSystem="2.16.840.1.113883.6.1" displayName="PLT"/>
<statusCode code="completed"/>
<effectiveTime value="200003231430"/>
<value xsi:type="PQ" value="123" unit="10+3/ul"/>
<interpretationCode code="L" codeSystem="2.16.840.1.113883.5.83"/>
<referenceRange>
<observationRange>
<value xsi:type="IVL_PQ">
<low value="150" unit="10+3/ul"/>
<high value="350" unit="10+3/ul"/>
</value>
</observationRange>
</referenceRange>
</observation>
</component>
</organizer>
</entry>
</section>
</component>
CCR Lab Results Section Example
The following is an example of the lab section of a CCR. It contains 1,241 characters, which is about half the size of the CCD.
<Results>
<Result>
<Test>
<DateTime>
<Type>
<Text>Collection start date</Text>
</Type>
<ExactDateTime>2000-03-23</ExactDateTime>
</DateTime>
<Description>
<Code>
<Value>30954-2</Value>
<CodingSystem>LOINC</CodingSystem>
</Code>
<Text>HGB</Text>
</Description>
<TestResult>
<Value>13.2</Value>
<Units>
<Unit>g/dl</Unit>
</Units>
</TestResults>
<NormalResult>
<Normal>
<Description>
<Text>13 - 18</Text>
</Description>
</Normal>
</NormalResult>
</Test>
<Test>
<DateTime>
<Type>
<Text>Collection start date</Text>
</Type>
<ExactDateTime>2000-03-23</ExactDateTime>
</DateTime>
<Description>
<Code>
<Value>33765-9</Value>
<CodingSystem>LOINC</CodingSystem>
</Code>
<Text>HGB</Text>
</Description>
<TestResult>
<Value>6.7</Value>
<Units>
<Unit>10+3>ul</Unit>
</Units>
</TestResult>
<NormalResult>
<Normal>
<Description>
<Text>4.3 - 10.8</Text>
</Description>
</Normal>
</NormalResult>
</Test>
<Test>
<DateTime>
<Type>
<Text>Collection start date</Text>
</Type>
<ExactDateTime>2000-03-23</ExactDateTime>
</DateTime>
<Description>
<Code>
<Value>26515-7</Value>
<CodingSystem>LOINC</CodingSystem>
</Code>
<Text>PLT</Text>
</Description>
<TestResult>
<Value>123*</Value>
<Units>
<Unit>meg/l</Unit>
</Units>
</TestResult>
<NormalResult>
<Normal>
<Description>
<Text>135-145</Text>
</Description>
</Normal>
</NormalResult>
</Test>
</Result>
</Results>
CSV Lab Results Data
The following shows the actual contents of a CSV file containing the same essential lab data as the CCD and CCR. It can produce the exact same report as the CCD and CCR, yet contains only 128 characters, which is about 1/20th (5%) the size of the CCD:
Results
HGB, g/dl,13-18,13.2,30954-2,2000-03-23
WBC, 10+3/ul,4.3-10.8,6.7,33765-9,2000-03-23
PLT, meq/l,135-145,123,26515-7,2000-03-23
Summary of the CCD, CCR, and CSV Differences
The differences between the CCD, CCR, and CSV are as follows:
- The CCD uses XML markup tags along with embedded HTML formatting tags, as well as extensive
- The CCR also uses XML, but contains no HTML and has much less metadata. It requires half as many characters as the CCR, but 10 times the CSV.
- The CSV file has no markup tags of any kind and only minimal metadata. It has the fewest characters by far. Like the other two formats, the CSV can generate CCR and CCD reports containing the same data elements.
- The CSV data structure is also much simpler than the XML structures.
- The CCD and CCR use complex Extensible Stylesheet Language (XSL) templates to transform and render the XML files, while the CSV is used by a spreadsheet template.
It's indisputable that the CSV data format has much greater simplicity and efficiency than XML, especially when storing and transmitting numeric data and small chunks of text. Nevertheless, XML data formats are more popular because few people have recognized that using data grids (spreadsheets) in a novel way: (a) conveys the meaning of the data stored in a CSV file and (b) enables the CSV's contents to be presented in rich reports.
But what's so important about simplicity and efficiency? Well, for one thing, it's a benefit when conservation of resources is important, such as minimizing bandwidth use, computer processing time, and storage space. And conservation of resources is crucial when time and money are important considerations, as well when accommodating people who lack the luxury of broadband Internet, large hard drives, and high speed computers. Also, in a world where technology is becoming increasing complex, simplicity is a "breath of fresh air;" in fact, we ought to be making things less costly and complex, not more!
Continued at this link. Print this post




4 comments:
Stephan,
Thanks for raising this issue. I’d like to provide the opposing viewpoint.
CSV stands for “Comma-Separated Values”. It is a simple columnar text file where the columns are separated by commas. One can think of a CSV file as a stripped-down Excel spreadsheet file that holds only “values” – no formatting or formulas. In fact, when you double click on a “.CSV” file in Windows, it opens in Excel by default. Most of us exchange raw data with our peers by e-mailing around CSV files. The strength of a CSV file – or an HL7 v2.x file, for that matter – is that it is a highly compact way of sending data across a network. It holds “nothing but the facts” and, consequently, it is easy to send around as an e-mail attachment, (or as a HL7 v 2.x bulk file transfer) even if it holds a large number of records. This is the primary appeal of simple text file data exchange standards.
The first drawback of a CSV file is that it includes no metadata, i.e., there is nothing in the CSV file that explains its contents. When you double-click on a CSV file and it opens in Excel there is nothing that tells you what kind of data is in each column (unless the sender was kind enough to add a row at the top with the column names). Therefore, in order to make any sense of a CSV file that someone sends to you, you have to be able to (in effect) open it in the same spreadsheet that was used to generate the file; complete with the column headers, column data formats and column definitions that the sender used to create the file. This is effectively what HL7 v2.x does for you: When someone sends you a simple HL7 v2.x “text file”, the HL7 v2.x standard tells you how to read that file.
The insurmountable problem with CSV or HL7 v2.x (well, one of them) is that there can be no assurances that the contents of each record are valid and in conformance with the standard. The person or system that created the CSV file or HL7 v2.x record can put anything they want “between the commas” before they send it to you – and they generally do! The “Garbage in, garbage out” problem is pervasive in the health care industry.
The key advantages of an XML-based messaging standard are that:
1) All of the information needed to interpret the contents of the message (i.e., the “metadata”) is included in the message, itself. This is why, as Stephan points out, an XML record is many times the size of a CSV record (and why a HL7 v3 record is many times the size of a HL7 v2.x record).
2) A XML message can be validated against standards before it is [generated, sent, transported, received, processed]. Thus, bad data can be stopped at the source. The generator/sender can be required to correct errors in the data before it is ever sent/processed. In the Health Care industry, the ability to prevent the generation of bogus data at the source would be HUGE! The industry spends [at least] hundreds of millions of dollars every year dealing with the consequences of erroneously entered data.
3) The flexibility of the XML format means that the sender can send as much or a little data as required by the context, within a single messaging framework. If all that’s required to submit a claim is the provider’s NPI, the ICD9 diagnosis code and the CPT code for the procedure, that’s fine. However, if the provider also needs/wants to submit additional information to a “registry” with more details about the procedure and its outcome, the same messaging framework could allow him/her to provide very detailed information from the “surgical notes” for the procedure within the same general framework (using HL7 v3.x and SNOMED-CT).
4) It’s easy to add additional types of data to an XML messaging framework without having to go back and change existing sections of the framework. In a CSV/HL7 v2.x type framework, everything has to fit on that one page of the “spreadsheet” (or you have to add a whole new “worksheet” to the spreadsheet).
All of this is worth the added bandwidth requirements!
Steve Elkins
Thanks for your thoughtful comment, Steve. I reply to it at this link.
Hmm, the beginnings of a "religious" debate.
Each has benefits which can be exploited in different scenarios.
The advantage of CVS is the compact size of the data. If high speed links fail, CVS will much more easily function on very look speed links. Think of falling back to a 54K baud phone line... do able with CVS files.
The advantage of XML is ease of recoverability. With XML, a human can easily recover data in the event of a disk crash or very dirty network connection.
All said and done, XML will more than likely become the data format of choice for exchanging data between the various vendors EMR/EHR systems, simply for the ease of debugging .
Thanks much for your comment! My reply follows.
You said: “The advantage of CVS is the compact size of the data. If high speed links fail, CVS will much more easily function on very look speed links. Think of falling back to a 54K baud phone line... do able with CVS files.”
My reply: This is one of many advantages of using CSV data files with data grid templates.
You said: “The advantage of XML is ease of recoverability. With XML, a human can easily recover data in the event of a disk crash or very dirty network connection.”
My reply: With the method I’ve been describing, you just open the CSV data file in its data grid template and the labels (metadata) are instantaneously applied to those data. A human can then clearly see and easily recover the data. You can even color-coded the cell in the template to facilitate recognition of data patterns as shown at this link. I assert that this makes it even easier to examine data set than by visually inspecting an XML document.
You said: “All said and done, XML will more than likely become the data format of choice for exchanging data between the various vendors EMR/EHR systems, simply for the ease of debugging.”
My reply: I’d even say that XML is currently the default format for EMR/EHR systems--not because it’s better in any way (or even as good as other methods), but simply because people are more aware of it. Anyway, as far as debugging ease goes, what’s so difficult about debugging data grid templates consisting of color-coded, commented, labeled, and logically organized spreadsheets?
Post a Comment