Python - Fixed Format Files, A COBOL Legacy: The codecs Module

Fixed Format Files, A COBOL Legacy: The `codecs` Module
	Chapter 34. File Formats: CSV, Tab, XML, Logs and Others

Fixed Format Files, A COBOL Legacy: The `codecs` Module

Files that come from COBOL programs have three characteristic features:

The file layout is defined positionally. There are no delimiters or separators on which to base file parsing. The file may not even have \n characters at the end of each record.
They're usually encoded in EBCDIC, not ASCII or Unicode.
They may include packed decimal fields; these are numeric values represented with two decimal digits (or a decimal digit and a sign) in each byte of the field.

The first problem requires figuring the starting position and size of each field. In some cases, there are no gaps (or filler) between fields; in this case the sizes of each field are all that are required. Once we have the position and size, however, we can use a string slice operation to pick those characters out of a record. The code is simply aLine[start:start+size].

We can tackle the second problem using the codecs module to decode the EBCDIC characters. The result of codecs.getdecoder('cp037') is a function that you can use as an EBCDIC decoder.

The third problem requires that our program know the data type as well as the position and offset of each field. If we know the data type, then we can do EBCDIC conversion or packed decimal conversion as appropriate. This is a much more subtle algorithm, since we have two strategies for converting the data fields. See the section called “Strategy” for some reasons why we'd do it this way.

In order to mirror COBOL's largely decimal world-view, we will need to use the decimal module for all numbers and airthmetic.

We note that the presence of packed decimal data changes the file from text to binary. We'll begin with techniques for handling a text file with a fixed layout. However, since this often slides over to binary file processing, we'll move on to that topic, also.

Reading an All-Text File. If we ignore the EBCDIC and packed decimal problems, we can easily process a fixed-layout file. The way to do this is to define a handy structure that defines our record layout. We can use this structure to parse each record, transforming the record from a string into a dictionary that we can use for further processing.

In this example, we also use a generator function, yieldRecords, to break the file into individual records. We separate this functionality out so that our processing loop is a simple for statement, as it is with other kinds of files. In principle, this generator function can also check the length of recBytes before it yields it. If the block of data isn't the expected size, the file was damaged and an exception should be raised.

layout = [ 
    ( 'field1', 0, 12 ),
    ( 'field2', 12, 4 ),
    ( 'anotherField', 16, 20 ),
    ( 'lastField', 36, 8 ),
]
reclen= 44

def yieldRecords( aFile, recSize ):
    recBytes= aFile.read(recSize)
    while recBytes:
        yield recBytes
        recBytes= aFile.read(recSize)

cobolFile= file( 'my.cobol.file', 'rb' )
for recBytes in yieldRecords(cobolFile, reclen):
    record = dict()
    for name, start, size in layout:
        record[name]= recBytes[start:start+len]

Reading Mixed Data Types. If we have to tackle the complete EBCDIC and packed decimal problem, we have to use a slightly more sophisticated structure for our file layout definition. First, we need some data conversion functions, then we can use those functions as part of picking apart a record.

We may need several conversion functions, depending on the kind of data that's present in our file. Minimally, we'll need the following two functions.

display: This function is used to get character data. In COBOL, this is called display data. It will be in EBCDIC if our files originated on a mainframe.
packed: This function is used to get packed decimal data. In COBOL, this is called "comp-3" data. In our example, we have not dealt with the insert of the decimal point prior to the creation of a decimal.Decimal object.

import codecs
display = codecs.getdecoder('cp037')

def packed( bytes ):
    n= [ '' ]
    for b in bytes[:-1]:
        hi, lo = divmod( ord(b), 16 )
        n.append( str(hi) )
        n.append( str(lo) )
    digit, sign = divmod( ord(bytes[-1]), 16 )
    n.append( str(digit) )
    if sign in (0x0b, 0x0d ):
        n[0]= '-'
    else:
        n[0]= '+'
    return n

Given these two functions, we can expand our handy record layout structure.

layout = [
    ( 'field1', 0, 12, display ),
    ( 'field2', 12, 4, packed ),
    ( 'anotherField', 16, 20, display ),
    ( 'lastField', 36, 8, packed ),
]
reclen= 44

This changes our record decoding to the following.

cobolFile= file( 'my.cobol.file', 'rb' )
for recBytes in yieldRecords(cobolFile, reclen):
    record = dict()
    for name, start, size, convert in layout:
        record[name]= convert( recBytes[start:start+len] )

This example underscores some of the key values of Python. Simple things can be kept simple. The layout structure, which describes the data, is both easy to read, and written in Python itself. The evolution of this example shows how adding a sophisticated feature can be done simply and cleanly.

At some point, our record layout will have to evolve from a simple tuple to a proper class definition. We'll need to take this evolutionary step when we want to convert packed decimal numbers into values that we can use for further processing.


Property Files and Configuration (or`.INI`) Files: The `ConfigParser` Module		XML Files: The `xml.minidom` and `xml.sax` Modules