Uncommon ASCII control code delimiters

So much code is worried about using “safe” delimiters, and quoting, and the like. I’m probably getting myself in trouble, but I just used a very old school ASCII delimiters in my code.

In particular I’m using ASCII 28 (File Separator), aka U+001c (Information Separator Four). This is one of the ASCII control codes for separators that’s mostly a forgotten thing. I’ve often wondered why we have CSV and TSV files instead of files delimited by ASCII 30 (Record Separator), too. That character is way, way less likely to occur in source input than a comma or a tab.

I hope I’m not getting myself into trouble somehow. My use case is the LoL netlog files, in particular I want to concatenate a JSON blob and the raw text logfile together to POST to a web server. The source data is very, very unlikely to contain any non-printable characters, including U+001c. If it does I’m screwed.

Javascript code to emit this

    var uploadText = [JSON.stringify(parsedLog),
                      '\n\u001c',
                      rawLog.substring(0, 1000000)].join('');

And Python to parse it (in a CGI)

try:
    jsonText, rawText = postdata.split(u'\u001c')
except:
    jsonText = '{&amp;amp;amp;amp;amp;amp;quot;valid&amp;amp;amp;amp;amp;amp;quot;:false}'
    rawText = postdata

Update: there’s now a draft IETF spec to use RS to separate JSON blobs to make it easier to stream-parse big wads of JSON data.

Nelson's log

A personal work journal

Uncommon ASCII control code delimiters

Related