Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com
Answertopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions
Privacy Policy

  




 

 

The Art of Unix Programming
Prev Home Next


Unix Programming - Chapter5.Textuality

It's a well-known fact that computing devices such as the abacus were invented thousands of years ago. But it's not well known that the first use of a common computer protocol occurred in the Old Testament. This, of course, was when Moses aborted the Egyptians' process with a control-sea.

-- Tom Galloway rec.arts.comics, February 1992

In this chapter, we'll look at what the Unix tradition has to tell us about two different kinds of design that are closely related: the design of file formats for retaining application data in permanent storage, and the design of application protocols for passing data and commands between cooperating programs, possibly over a network.

What unifies these two kinds of design is that they both involve the serialization of in-memory data structures. For the internal operation of computer programs, the most convenient representation of a complex data structure is one in which all fields have the machine's native data format (e.g. two's-complement binary for integers) and all pointers are actual memory addresses (as opposed, say, to being named references). But these representations are not well suited to storage and transmission; memory addresses in the data structure lose their meaning outside memory, and emitting raw native data formats causes interoperability problems passing data between machines with different conventions (big- vs. little-endian, say, or 32-bit vs. 64-bit).

For transmission and storage, the traversable, quasi-spatial layout of data structures like linked lists needs to be flattened or serialized into a byte-stream representation from which the structure can later be recovered. The serialization (save) operation is sometimes called marshaling and its inverse (load) operation unmarshaling. These terms are usually applied with respect to objects in an OO language like C++ or Python or Java, but could be used with equal justice of operations like loading a graphics file into the internal storage of a graphics editor and saving it out after modifications.

A significant percentage of what C and C++ programmers maintain is ad-hoc code for marshaling and unmarshaling operations — even when the serialized representation chosen is as simple as a binary structure dump (a common technique under non-Unix environments). Modern languages like Python and Java tend to have built-in unmarshal and marshal functions that can be applied to any object or byte-stream representing an object, and that reduce this labor substantially.

But these nave methods are often unsatisfactory for various reasons, including both the machine-interoperability problems we mentioned above and the negative trait of being opaque to other tools. When the application is a network protocol, economy may demand that an internal data structure (such as, say, a message with source and destination addresses) be serialized not into a single blob of data but into a series of attempted transactions or messages which the receiving machine may reject (so that, for example, a large message can be rejected if the destination address is invalid).

Interoperability, transparency, extensibility, and storage or transaction economy: these are the important themes in designing file formats and application protocols. Interoperability and transparency demand that we focus such designs on clean data representations, rather than putting convenience of implementation or highest possible performance first. Extensibility also favors textual protocols, since binary ones are often harder to extend or subset cleanly. Transaction economy sometimes pushes in the opposite direction — but we shall see that putting that criterion first is a form of premature optimization that it is often wise to resist.

Finally, we must note a difference between data file formats and the run-control files that are often used to set the startup options of Unix programs. The most basic difference is that (with sporadic exceptions like GNU Emacs's configuration interface) programs don't normally modify their own run-control files — the information flow is one-way, from file read at startup time to application settings. Data-file formats, on the other hand, associate properties with named resources and are both read and written by their applications. Configuration files are generally hand-edited and small, whereas data files are program-generated and can become arbitrarily large.

Historically, Unix has related but different sets of conventions for these two kinds of representation. The conventions for run control files are surveyed in Chapter10; only conventions for data files are examined in this chapter.


[an error occurred while processing this directive]
The Art of Unix Programming
Prev Home Next

 
 
  Published under free license. Design by Interspire