Why self describing data formats:

Anne & Lynn Wheeler lynn at garlic.com
Sat Jun 9 15:17:17 EDT 2007


James A. Donald wrote:
> Many protocols use some form of self describing data format, for example 
> ASN.1, XML, S expressions, and bencoding.
> 
> Why?

gml (precursor to sgml, html, xml, etc) 
http://www.garlic.com/~lynn/subtopic.html#sgml

was invented at the science center in 1969 
http://www.garlic.com/~lynn/subtopic.html#545tech

... some recent (science center) topic drift/references in this post
http://www.garlic.com/~lynn/2007l.html#65 mainframe = superserver

"G", "M", & "L" were individuals at the science center ... so the
requirement was to come up with an acronym from the inventors initials

so some of the historical justification for the original "markup language" paradigm
can be found 

originally CMS had the script command for document formating ... using
"dot" format commands ... i.e. science center on 4th flr of 545 tech sq
doing virtual machines, cp67, cms, the internal network, etc ... and multics
on 5th flr of 545 tech sq ... draw from some common heritage to CTSS (and some
of the unix heritage traces back thru multics also to CTSS).

the original GML was sort of a combination of "self-describing" data (somewhat for
legal documents) 
http://www.sgmlsource.com/history/roots.htm
http://xml.coverpages.org//sgmlhist0.html

and document formating ... when GML tag formating was added to CMS script processing 
command. Later you find a big CMS installation at CERN ... and HTML drawing heritage 
from the "waterloo" clone of the CMS script command.
http://infomesh.net/html/history/early

first webserver in the states was at slac (a CERN "sister location) ... another 
big vm/cms installation:
http://www.slac.stanford.edu/history/earlyweb/history.shtml

recent historical post/reference
http://www.garlic.com/~lynn/2007d.html#29 old tapes

last time i checked, w3c hdqtrs was around the corner from the old
science center location at 545 tech. sq.

before GML, the science center had an activity involving "performance" data
from the time-sharing service (originally using virtual machine cp67 service
and then transitioning to vm370) ... lots of system activity data was captured
every 5-10 minutes and then archived to tape ... starting in the mid-60s ...
by the mid-70s there was a decade of data spanning lots of different configurations,
workloads, etc. The original intention when the system activity data was being
archived was to include enuf self-describing information that the data could
be interpreted many yrs later. lots of past posts about using cp67&vm370
for time-sharing services (both for internal corporate use and customers offering
commercial, online time-sharing services using the platform)
http://www.garlic.com/~lynn/subtopic.html#timeshare

lots of past posts about long term performance monitoring, workload profiling,
benchmarking and stuff leading up to things like capacity planning
http://www.garlic.com/~lynn/subtopic.html#benchmark

much later, you find things like ASN.1 encoding for handling interoperability
of network transmitted data between platforms that might have different
information representation conventions (like the whole little/big endian stuff).

one of the things swirling around digital signature activity in the mid-90s
was almost religious belief that digital certificate encoding mandated 
ASN.1. 

other digital signature operations that were less religious about PKI, 
x.509 identity digital certificates, etc ... were much less strict
about encoding technique for digitally signed operations ... included
certificateless digital signature infrastructures
http://www.garlic.com/~lynn/subpubkey.html#certless

One of the battles during the period between XML and ASN.1 proponents
during the period was that XML didn't provide for a deterministic encoding.
It really was somewhat a red herring on the digital certificate ... ASN.1
side ... since they were looking at always keeping things ASN.1 encoded
(not just for transmission) ... and only decoding when some specific 
information needed extraction.

On the other side was places like FSTC which was defining digitally
signed electronic check convention (with tranmission over ACH or ISO8583).
There was already a transmission standard ... which ASN.1 encoding would
severely bloat ... not to mention the horrible payload bloat that was
the result of any certificate-based infrastructure needing to append
redundand and superfluous digital certificates.

FSTC just defined appending a digital signature to existing payload.
The issue then became a deterministic encoding of the information
for when the digital signature was generated and verified. If you
temporarily encoded the payload as XML, generated the digital signature
... and then appended the digital signature to the standard (ACH or
ISO8583) payload ... the problem was that at the other end,
XML didn't provide a deterministic encoding methodology so that
the recipient could re-encode the payload and verify the digital
signature. So FSTC eventually defined some additional rules for
XML called FSML ... which then was turned over to W3C as part of
XML digital signature activity.

There was something of a cultural class between the FSTC orientation and
much of the x.509 standards environment. In the FSTC world ... the information
is only temporarily encoded for digital signature generation and verification;
the rest of the time, the data is in some native useable form. In the X.509
standards environment, the data tends to always remain encoded in ASN.1
format ... and is only (temporarily) decoded when it is actually needed
to be used/accessed.


---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com



More information about the cryptography mailing list