Monday, December 28, 2009

Dueling Data Formats - Continued

This is in response to Steve Elkins critical comments about my post regarding XML and CSV data formats at this link.

Let me being for thanking Steve for his comments about use of CSV data format compared XML. It's only through such debate that useful understanding can emerge and progress promoted.

As Steve Elkins said, a CSV (comma separated value) file is just rows of raw data (no formatting or formulas) in which each data element (piece of data) is separated by a comma (delimiter). This makes CSV literally the most efficient (least resource consuming) way to store and send data.

Enabling a CSV file to provide all the benefits of XML (and more) requires using CSV in a novel way. As I described elsewhere on my blog, this is done using a novel software method I invented, called CP Split™, which splits content (data and information) from presentation (the formatting instructions that render it for viewing).

What makes the CP Split method unique is the way it uses pairs of data grid template software (automated spreadsheet programs) coupled with CSV data files. One template creates the data file and the other consumes (parses) and renders it. This method provides all the advantages of XML (and more), as well as the ability to consume and use XML, while avoiding XML's disadvantages (i.e., its complexity, verbosity, and bandwidth requirements). Following I explain this novel software process in response to Steve Elkins' four points.

Point 1 – Steve Elkins wrote:
The first drawback of a CSV file is that it includes no metadata [data about the data], i.e., there is nothing in the CSV file that explains its contents…[With XML, on the other hand, all] of the information needed to interpret the contents of the message (i.e., the "metadata") is included in the message, itself. This is why, as Stephen [Beller] points out, an XML record is many times the size of a CSV record (and why a HL7 v3 record is many times the size of a HL7 v2.x record).
My (Steve Beller's) reply:

While it's true that a CSV record (i.e., a CSV data file) is much smaller, our novel technology interprets the meaning (semantics) of a data file's contents using the data grid templates. By way of example, take the CCR shown at this link. The metadata contained within the CCR's markup tags (i.e., between the angle brackets <…>) for the first blood test result is written in XML code as:

Using the metadata in the XML code provides the following interpretation: Blood collected on 2000-03-23 was tested for HBG (hemoglobin). The test had a LOINC description code of 30954-2. The data value of the test was 13.2 g/dl and the normal (reference) range for the test was between 13 and 18.

In contrast, using the CP Split method, a data grid template automatically arranges the raw data (without the metadata) into particular cells in a grid (spreadsheet). For example, it might arrange the blood test result data above in a row of adjacent cells in which cell A1—the cell in the first column (A) and first row (1)—contains "2000-03-23", cell A2 contains "LOINC 30954-2", cell A3 contains "HGB", etc. When it converts the cell contents to a CSV data file, the first row of cells in the grid becomes a line of data containing those data separated by commas, as such:

3-23-2000,LOINC 30954-2,HGB,13.2,g/dl,13,18

The result is that the XML requires 432 characters to store the data set, while the CSV data file requires only 42 (over a ten-fold size reduction).

The CSV data file, however, doesn't have any metadata, as Steve Elkins pointed out. So, lacking such metadata, what interprets the 1st group of numbers (to the left of the first comma) as the date of the blood collection, the 2nd as the terminology standard code, the 3rd as the name of the test, the 4th as the value, the 5th as the unit measure, the 6th as the lower normal range value, and the 7th as the upper normal range value?!? The answer follows.

As I mentioned earlier, the CP Split method uses two corresponding data grid template (spreadsheet) software programs. The first template, called the Publisher Template (PT), creates the CSV data file. The second, called the Subscriber Template (ST), consumes, interprets and renders the data file's contents. Here are the basic steps of this process:
  1. After obtaining the necessary data from the data source(s), the PT arranges the data into meaningfully/logically organized preplanned structures in which each data element is stored in a particular predetermined cell in a particular data grid (spreadsheet).
  2. The PT then takes the data from the grid and stores them in a CSV data file.
  3. The PT then ships the CSV data file to a corresponding data grid template software program, called the Subscriber Template (ST). This template which "knows" what the data in the CSV file mean based on their location in the file. That is, the ST interprets each data element's meaning based on the datum's row number and the number of commas preceding it on that row. The ST gets this "knowledge" during the construction of the PT and ST, whereby metadata is stored in the templates, not in the CSV data file.
  4. The ST then consumes (parses) the CSV data file and sends its contents into the cells of a data grid in a way that reflects the location of the data in the CSV file (e.g., the first data element on the first row of the CSV is put in call "A1" of the data grid). This maintains the ST's knowledge of what the data's means.
  5. The ST may then:
    • Export the those data from the data grid to a database and/or
    • Retrieve the data from the data grid and presents the data in dynamic reports by applying any required labels, formatting instructions and analytic algorithms.

Bottom line: As the data move from the PT to the ST via shipment of the CSV data file, there is no loss of data meaning even though the CSV file contains no metadata!

Point 2 – Steve Elkins wrote: "A XML message can be validated against standards before it is generated and/or accepted. Thus, bad data can be stopped at the source."

My reply: The same can be done via the PT and ST. Any validation rules can be written in the PT by which data are validated before generating the CSV data file. And any rules can be written in the ST by which data are validated before the CSV data file is accepted.

Point 3 - Steve Elkins wrote: "The flexibility of the XML format means that the sender can send as much or a little data as required by the context, within a single messaging framework."

My reply: The same can be done by the PT, i.e., it can include any portions of any data set in a single CSV data file as is necessary. In addition, a CSV data file can be decomposited so that only certain parts can be sent/accessed by a ST based on any context rules. AND multiple CSV data files sent from one or more PTs to a single ST can be combined into a single data file and/or composite report by the ST.

Point 4 - Steve Elkins wrote: "It's easy to add additional types of data to an XML messaging framework without having to go back and change existing sections of the framework."

My reply: Any types of data can be added easily to a CSV data file by the PT. Just as XML would require a change to the XSD/DTD schema file if a new data type is added, the PT and ST would have to be adjusted accordingly. In addition, although not previously discussed, the CSV data files and the templates can manage hierarchical data as does XML.


Anything XML can do CSV can also do using the novel method I describe above. And using CSV data files and data grid templates has distinct advantages in terms of speed, simplicity, and resource conservation. Another advantage of CSV and data grids, that was not discussed above, is the ease by which graphs are presented—since data can be stored in preconfigured arrays in the CSV data file and read by graph-generating templates in the ST—all without the extra overhead and complexity of XSLT data transformation.

I do believe, however, that XML has an import role to play, especially when dealing with large blocks of text (as opposed to numeric value and short data strings). I'm very interested in discussing the pros and cons of using CSV data files via the patented CP Split method compared to using XML method for particular use cases, and examining how the two methods can complement each other.


In response to a comment received, I added a screenshot of a Publisher grid template and the CSV file it creates at this link.
Post a Comment