Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions

  




 

 

The File Archive Modules: tarfile and zipfile

An archive file contains a complex, hierarchical file directory in a single sequential file. The archive file includes the original directory information as well as a the contents of all of the files in those directories. There are a number of archive file formats, Python directory supports two: tar and zip archives.

The tar (Tape Archive) format is widely used in the GNU/Linux world to distribute files. It is a POSIX standard, making it usable on a wide variety of operating systems. A tar file can also be compressed, often with the GZip utility, leading to .tgz or .tar.gz files which are compressed archives.

The Zip file format was invented by Phil Katz at PKWare as a way to archive a complex, hierarchical file directory into a compact sequential file. The Zip format is widely used but is not a POSIX standard. Zip file processing includes a choice of compression algorithms; the exact algorithm used is encoded in the header of the file, not in the name of file.

Creating a TarFile or a ZipFileSince an archive file is still, essentially a file, it is opened with a variation on the open function. Since an archive file contains directory and file contents, it has a number of methods above and beyond what a simple file has.

tarfile.open〈 name 〉〈 mode 〉〈 fileobj 〉〈 buffersize TarFile

This module-level function opens the given tar file for processing. The name is a file name string; it is optional because the fileobj can be used instead. The mode is similar to the built-in open (or file) function; it has additional characters to specify the compression algorithms, if any. The fileobject is a conventional file object, which can be used instead of the name ; it can be a standard file like sys.stdin. The buffersize is like the built-in open function.

zipfile.(ZipFile name , mode , compression )→ ZipFile

This class constructor opens the given zip file for processing. The name is a file name string. The mode is similar to the built-in open (or file) function. The compression is the compression code. It can be zipfile.ZIP_STORED or zipfile.ZIP_DEFLATED. A compression of ZIP_STORED uses no compression; a value of ZIP_DEFLATED uses the Zlib compression algorithms

The open function can be used to read or write the archive file. It can be used to process a simple disk file, using the filename. Or, more importantly, it can be used to process a non-disk file: this includes tape devices and network sockets. In the non-disk case, a file object is given to tarfile.open.

For tar files, the mode information is rather complex because we can do more than simply read, write and append. The mode string adresses three issues: the kind of opening (reading, writing, appending), the kind of access (block or stream) and the kind of compression.

For zip files, however, the mode is simply the kind of opening that is done.

Opening - Both zip and tar files. A zip or tar file can be opened in any of three modes.

r

Open the file for reading.

w

Open the file for writing.

a

Open the file for appending.

Access - tar files only. A tar file can have either of two fundamentally different kinds of access. If a tar file is a disk file, which supports seek and tell operations, then you we access the tar file in block mode. If the tar file is a stream, network connection or a pipeline, which does not support seek or tell operations, then we must access the archive in stream mode.

:

Block mode. The tar file is an disk file, and seek and tell operations are supported. This is the assumed default, if neither : or | are specified.

|

Stream mode. The tar file is a stream, socket or pipeline, and cannot respond to seek or tell operations. Note that you cannot append to a stream, so the 'a|' combination is illegal.

This access distinction isn't meaningful for zip files.

Compression - tar files only. A tar file may be compressed with GZip or BZip2 algorithms, or it may be uncompressed. Generally, you only need to select compression when writing. It doesn't make sense to attempt to select compression when appending to an existing file, or when reading a file.

(nothing)

The tar file will not be compressed.

gz

The tar file will be compressed with GZip.

bz2

The tar file will be compressed with BZip2.

This compression distinction isn't meaningful for zip files. Zip file compression is specified in the zipfile.ZipFile constructor.

Tar File Examples. The most common block modes for tar files are r, a, w:, w:gz, w:bz2. Note that read and append modes cannot meaningfully provide compression information, since it's obvious from the file if it was compressed, and which algorithm was used.

For stream modes, however, the compression information must be provided. The modes include all six combinations: r|, r|gz, r|bz2, w|, w|gz, w|bz2.

Directory Information. Each individual file in a tar archive is described with a TarInfo object. This has name, size, access mode, ownership and other OS information on the file. A number of methods will retrieve member information from an archive. In the following summaries, tf is a tar file, created with tarfile.open.

tf. getmember ( name ) → TarInfo

Reads through the archive index looking for the given member name . Returns a TarInfo object for the named member, or raises a KeyError exception.

tf. getmembers → list of TarInfo

Returns a list of TarInfo objects for all of the members in the archive.

tf. next TarInfo

Returns a TarInfo object for the next member of the archive.

tf. getnames → list of strings

Returns a list of member names.

Each individual file in a zip archive is described with a ZipInfo object. This has name, size, access mode, ownership and other OS information on the file. A number of methods will retrieve member information from an archive. In the following summaries, zf is a zip file, created with zipfile.ZipFile.

zf. , (getinfo name ) → ZipInfo

Locates information about the given member name . Returns a ZipInfo object for the named member, or raises a KeyError exception.

zf. , (infolist) → list of ZipInfo

Returns a list of ZipInfo objects for all of the members in the archive.

zf. namelist → list of strings

Returns a list of member names.

Extracting Files From an Archive. If a tar archive is opened with r, then you can read the archive and extract files from it. The following methods will extract member files from an archive. In these summaries, tf is a tar file, created with tarfile.open.

tf. extract ( member , 〈 path 〉)

The member can be either a string member name or a TarInfo for a member. This will extract the file's contents and reconstruct the original file. If path is given, this is the new location for the file.

tf. extractfile ( member ) → file

The member can be either a string member name or a TarInfo for a member. This will open a simple file for access to this member's contents. The member access file has only read-oriented methods, limited to read, readline, readlines, seek, tell.

If a zip archive is opened with r, then you can read the archive and extract the contents of a file from it. In these summaries, zf is a zip file, created with zipfile.ZipFile.

zf. read ( member ) → string

The member is a string member name. This will extract the member's contents, decompress them if necessary, and return the bytes that consitute the member.

Creating or Extending an Archive. If a tar archive is opened with w or a, then you can add files to it. The following methods will add member files to an archive. In the following summaries, tf is a tar file, created with tarfile.open.

tf. add ( name , 〈 arcname 〉〈 recursive 〉)

Adds the file with the given name to the current archive file. If arcname is provided, this is the name the file will have in the archive; this allows you to build an archive which doesn't reflect the source structure. Generally, directories are expanded; using recursive=False prevents expanding directories.

tf. addfile ( tarinfo , fileobj )

Creates an entry in the archive. The description comes from the tarinfo , an instance of TarInfo, created with the gettarinfo function. The fileobj is an open file, from which the content is read. Note that the TarInfo.size field can override the actual size of the file. For a given filename, fn, this might look like the following: tf.addfile( tf.gettarinfo(fn), open(fn,"r") ).

tf. (close)

Closes the archive. For archives being written or appended, this adds the block of zeroes that defines the end of the file.

tf. gettarinfo ( name , arcname , fileobj ) → TarInfo

Creates a TarInfo object for a file based either on name , or the fileobj . If a name is given, this is a local filename. The arcname is the name that will be used in the archive, allowing you to modify local filesystem names. If the fileobj is given, this file is interrogated to gather required information.

If a zip archive is opened with w or a, then you can add files to it. The following methods will add member files to an archive. In the following summaries, zf is a zip file, created with zipfile.ZipFile.

zf. write ( filename , arcname , compress ) → string

The filename is a string file name. This will read the file, compress it, and write it to the archive. If the arcname is given, this will be the name in the archive; otherwise it will use the original filename . The compress parameter overrides the default compression specified when the ZipFile was created.

zf. writestr ( arcname , bytes ) → string

The arcname is a string file name or a ZipInfo object that will be used to create a new member in the archive. This will write the given bytes to the archive. The compression used is specified when the ZipFile is created.

A tarfile Example. Here's an example of a program to examine a tarfile, looking for documentation like .html files or README files. It will provide a list of .html files, and actually show the contents of the README files.

Example 33.2. readtar.py

#!/usr/bin/env python
"""Scan a tarfile looking for *.html and a README."""
import tarfile
import fnmatch

archive= tarfile.open( "SQLAlchemy-0.3.5.tar.gz", "r" )
for mem in archive.getmembers():
    if fnmatch.fnmatch( mem.name, "*.html" ):
        print mem.name
    elif fnmatch.fnmatch( mem.name.upper(), "*README*" ):
        print mem.name
        docFile= archive.extractfile( mem )
        print docFile.read()
        

A zipfile Example. Here's an example of a program to create a zipfile based on the .xml files in a particular directory.

Example 33.3. writezip.py

import zipfile, os, fnmatch

bookDistro= zipfile.ZipFile( 'book.zip', 'w', zipfile.ZIP_DEFLATED )
for nm in os.listdir('..'):
    if fnmatch.fnmatch(nm,'*.xml'):
        full= os.path.join( '..', nm )
        bookDistro.write( full )
bookDistro.close()


 
 
  Published under the terms of the Open Publication License Design by Interspire