Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions

  




 

 

Chapter 34. File Formats: CSV, Tab, XML, Logs and Others

We looked at general features of the file system in Chapter 19, Files . In this chapter we'll look at Python techniques for handling files in a few of the innumeraable formats that are in common use. Most file formats are relatively easy to handle with Python techniques we've already seen. Comma-Separated Values (CSV) files, XML files and packed binary files, however, are a little more sophisticated.

This only the tip of the iceberg in the far larger problem called “persistence”. In addition to simple file system persistence, we also have the possibility of object persistence using an object database. In this case, the databse processing lies between our program and the file system on which the database resides. This area also includes object-relational mapping, where our program relies on a mapper; the mapper uses to database, and the database manages the file system. We can't explore the whole persistence problem in this chapter.

In this chapter we'll present a conceptual overview of the various approaches to reading and writing files in the section called “Overview”. We'll look at reading and writing CSV files in the section called “Comma-Separated Values: The csv Module”, tab-delimited files in the section called “Tab Files: Nothing Special”. We'll look reading property files in the section called “Property Files and Configuration (or.INI) Files: The ConfigParser Module”. We'll look at the subleties of processing legacy COBOL files in the section called “Fixed Format Files, A COBOL Legacy: The codecs Module”. We'll cover the basics of reading XML files in the section called “XML Files: The xml.minidom and xml.sax Modules”.

Most programs need a way to write sophisticated, easy-to-control log files what contain status and debugging information. For simple one-page programs, the print statement is fine. As soon as we have multiple modules, where we need more sophisticated debugging, we find a need for the logging module. Of course, any program that requires careful auditing will benefit from the logging module. We'll look at creating standard logs in the section called “Log Files: The logging Module”.

Overview

When we introduced the concept of file we mentioned that we could look at a file on two levels.

  • A file is a sequence of bytes. This is the OS's view of views, as it is the lowest-common denominator.

  • A file is a sequence of data objects, represented as sequences of bytes.

A file format is the processing rules required to translate between usable Python objects and sequences of bytes. People have invented innumerable distinct file formats. We'll look at some techniques which should cover most of the bases.

We'll look at three broad families of files: text, binary and pickled objects. Each has some advantages and processing complexities.

  • Text files are designed so that a person can easily read and write them. We'll look at several common text file formats, including CSV, XML, Tab-delimited, property-format, and fixed position. Since text files are intended for human consumption, they are difficult to update in place.

  • Binary files are designed to optimize processing speed or the overall size of the file. Most databases use very complex binary file formats for speed. A JPEG file, on the other hand, uses a binary format to minimize the size of the file. A binary-format file will typically place data at known offsets, making it possible to do direct access to any particular byte using the seek method of a Python file object.

  • Pickled Objects are produced by Python's pickle or shelve modules. There are several pickle protocols available, including text and binary alternatives. More importantly, a pickled file is not designed to be seen by people, nor have we spent any design effort optimizng performace or size. In a sense, a pickled object requires the least design effort.


 
 
  Published under the terms of the Open Publication License Design by Interspire