Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Mail Systems
Eclipse Documentation

How To Guides
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Problem Solutions
Privacy Policy




Previous: gzip, Up: Compression

8.2.2 Archiving Sparse Files

Handle sparse files efficiently.

This option causes all files to be put in the archive to be tested for sparseness, and handled specially if they are. The --sparse (-S) option is useful when many dbm files, for example, are being backed up. Using this option dramatically decreases the amount of space needed to store such a file.

In later versions, this option may be removed, and the testing and treatment of sparse files may be done automatically with any special GNU options. For now, it is an option needing to be specified on the command line with the creation or updating of an archive.

Files in the file system occasionally have “holes.” A hole in a file is a section of the file's contents which was never written. The contents of a hole read as all zeros. On many operating systems, actual disk storage is not allocated for holes, but they are counted in the length of the file. If you archive such a file, tar could create an archive longer than the original. To have tar attempt to recognize the holes in a file, use --sparse (-S). When you use this option, then, for any file using less disk space than would be expected from its length, tar searches the file for consecutive stretches of zeros. It then records in the archive for the file where the consecutive stretches of zeros are, and only archives the “real contents” of the file. On extraction (using --sparse is not needed on extraction) any such files have holes created wherever the continuous stretches of zeros were found. Thus, if you use --sparse, tar archives won't take more space than the original.

A file is sparse if it contains blocks of zeros whose existence is recorded, but that have no space allocated on disk. When you specify the --sparse option in conjunction with the --create (-c) operation, tar tests all files for sparseness while archiving. If tar finds a file to be sparse, it uses a sparse representation of the file in the archive. See create, for more information about creating archives.

--sparse is useful when archiving files, such as dbm files, likely to contain many nulls. This option dramatically decreases the amount of space needed to store such an archive.

Please Note: Always use --sparse when performing file system backups, to avoid archiving the expanded forms of files stored sparsely in the system.

Even if your system has no sparse files currently, some may be created in the future. If you use --sparse while making file system backups as a matter of course, you can be assured the archive will never take more space on the media than the files take on disk (otherwise, archiving a disk filled with sparse files might take hundreds of tapes).

tar ignores the --sparse option when reading an archive.

Files stored sparsely in the file system are represented sparsely in the archive. Use in conjunction with write operations.

However, users should be well aware that at archive creation time, GNU tar still has to read whole disk file to locate the holes, and so, even if sparse files use little space on disk and in the archive, they may sometimes require inordinate amount of time for reading and examining all-zero blocks of a file. Although it works, it's painfully slow for a large (sparse) file, even though the resulting tar archive may be small. (One user reports that dumping a core file of over 400 megabytes, but with only about 3 megabytes of actual data, took about 9 minutes on a Sun Sparcstation ELC, with full CPU utilization.)

This reading is required in all cases and is not related to the fact the --sparse option is used or not, so by merely not using the option, you are not saving time1.

Programs like dump do not have to read the entire file; by examining the file system directly, they can determine in advance exactly where the holes are and thus avoid reading through them. The only data it need read are the actual allocated data blocks. GNU tar uses a more portable and straightforward archiving approach, it would be fairly difficult that it does otherwise. Elizabeth Zwicky writes to comp.unix.internals, on 1990-12-10:

What I did say is that you cannot tell the difference between a hole and an equivalent number of nulls without reading raw blocks. st_blocks at best tells you how many holes there are; it doesn't tell you where. Just as programs may, conceivably, care what st_blocks is (care to name one that does?), they may also care where the holes are (I have no examples of this one either, but it's equally imaginable).

I conclude from this that good archivers are not portable. One can arguably conclude that if you want a portable program, you can in good conscience restore files with as many holes as possible, since you can't get it right.


[1] Well! We should say the whole truth, here. When --sparse is selected while creating an archive, the current tar algorithm requires sparse files to be read twice, not once. We hope to develop a new archive format for saving sparse files in which one pass will be sufficient.

  Published under the terms of the GNU General Public License Design by Interspire