Advanced Bash-Scripting Guide:
Prev	Chapter 12. External Filters, Programs and Commands	Next

12.4. Text Processing Commands

Commands affecting text and text files

sort

File sorter, often used as a filter in a pipe. This command sorts a text stream or file forwards or backwards, or according to various keys or character positions. Using the -m option, it merges presorted input files. The info page lists its many capabilities and options. See Example 10-9, Example 10-10, and Example A-8.

tsort

Topological sort, reading in pairs of whitespace-separated strings and sorting according to input patterns.

uniq

This filter removes duplicate lines from a sorted file. It is often seen in a pipe coupled with sort.
cat list-1 list-2 list-3 | sort | uniq > final.list # Concatenates the list files, # sorts them, # removes duplicate lines, # and finally writes the result to an output file.

The useful -c option prefixes each line of the input file with its number of occurrences.

bash$ cat testfile This line occurs only once. This line occurs twice. This line occurs twice. This line occurs three times. This line occurs three times. This line occurs three times. bash$ uniq -c testfile 1 This line occurs only once. 2 This line occurs twice. 3 This line occurs three times. bash$ sort testfile | uniq -c | sort -nr 3 This line occurs three times. 2 This line occurs twice. 1 This line occurs only once.

The sort INPUTFILE | uniq -c | sort -nr command string produces a frequency of occurrence listing on the INPUTFILE file (the -nr options to sort cause a reverse numerical sort). This template finds use in analysis of log files and dictionary lists, and wherever the lexical structure of a document needs to be examined.

Example 12-11. Word Frequency Analysis

#!/bin/bash
# wf.sh: Crude word frequency analysis on a text file.
# This is a more efficient version of the "wf2.sh" script.


# Check for input file on command line.
ARGS=1
E_BADARGS=65
E_NOFILE=66

if [ $# -ne "$ARGS" ]  # Correct number of arguments passed to script?
then
  echo "Usage: `basename $0` filename"
  exit $E_BADARGS
fi

if [ ! -f "$1" ]       # Check if file exists.
then
  echo "File \"$1\" does not exist."
  exit $E_NOFILE
fi



########################################################
# main ()
sed -e 's/\.//g'  -e 's/\,//g' -e 's/ /\
/g' "$1" | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
#                           =========================
#                            Frequency of occurrence

#  Filter out periods and commas, and
#+ change space between words to linefeed,
#+ then shift characters to lowercase, and
#+ finally prefix occurrence count and sort numerically.

#  Arun Giridhar suggests modifying the above to:
#  . . . | sort | uniq -c | sort +1 [-f] | sort +0 -nr
#  This adds a secondary sort key, so instances of
#+ equal occurrence are sorted alphabetically.
#  As he explains it:
#  "This is effectively a radix sort, first on the
#+ least significant column
#+ (word or string, optionally case-insensitive)
#+ and last on the most significant column (frequency)."
#
#  As Frank Wang explains, the above is equivalent to
#+       . . . | sort | uniq -c | sort +0 -nr
#+ and the following also works:
#+       . . . | sort | uniq -c | sort -k1nr -k
########################################################

exit 0

# Exercises:
# ---------
# 1) Add 'sed' commands to filter out other punctuation,
#+   such as semicolons.
# 2) Modify the script to also filter out multiple spaces and
#    other whitespace.

expand, unexpand

The expand filter converts tabs to spaces. It is often used in a pipe.

The unexpand filter converts spaces to tabs. This reverses the effect of expand.

cut

A tool for extracting fields from files. It is similar to the print $N command set in awk, but more limited. It may be simpler to use cut in a script than awk. Particularly important are the -d (delimiter) and -f (field specifier) options.

Using cut to obtain a listing of the mounted filesystems:
cut -d ' ' -f1,2 /etc/mtab

Using cut to list the OS and kernel version:
uname -a | cut -d" " -f1,3,11,12

Using cut to extract message headers from an e-mail folder:
bash$ grep '^Subject:' read-messages | cut -c10-80 Re: Linux suitable for mission-critical apps? MAKE MILLIONS WORKING AT HOME!!! Spam complaint Re: Spam complaint

Using cut to parse a file:
# List all the users in /etc/passwd. FILENAME=/etc/passwd for user in $(cut -d: -f1 $FILENAME) do echo $user done # Thanks, Oleg Philon for suggesting this.

cut -d ' ' -f2,3 filename is equivalent to awk -F'[ ]' '{ print $2, $3 }' filename

It is even possible to specify a linefeed as a delimiter. The trick is to actually embed a linefeed (RETURN) in the command sequence.

bash$ cut -d' ' -f3,7,19 testfile This is line 3 of testfile. This is line 7 of testfile. This is line 19 of testfile.

Thank you, Jaka Kranjc, for pointing this out.

See also Example 12-35.

tail

lists the end of a file to stdout (the default is 10 lines). Commonly used to keep track of changes to a system logfile, using the -f option, which outputs lines appended to the file.

Example 12-14. Using tail to monitor the system log

#!/bin/bash

filename=sys.log

cat /dev/null > $filename; echo "Creating / cleaning out file."
#  Creates file if it does not already exist,
#+ and truncates it to zero length if it does.
#  : > filename   and   > filename also work.

tail /var/log/messages > $filename  
# /var/log/messages must have world read permission for this to work.

echo "$filename contains tail end of system log."

exit 0

To list a specific line of a text file, pipe the output of head to tail -1. For example head -8 database.txt | tail -1 lists the 8th line of the file database.txt.

To set a variable to a given block of a text file:
var=$(head -$m $filename | tail -$n) # filename = name of file # m = from beginning of file, number of lines to end of block # n = number of lines to set variable to (trim from end of block)

See also Example 12-5, Example 12-35 and Example 29-6.

grep

A multi-purpose file search tool that uses Regular Expressions. It was originally a command/filter in the venerable ed line editor: g/re/p -- global - regular expression - print.

grep pattern [file...]

Search the target file(s) for occurrences of pattern, where pattern may be literal text or a Regular Expression.

bash$ grep '[rst]ystem.$' osinfo.txt The GPL governs the distribution of the Linux operating system.

If no target file(s) specified, grep works as a filter on stdout, as in a pipe.

bash$ ps ax | grep clock 765 tty1 S 0:00 xclock 901 pts/1 S 0:00 grep clock

The -i option causes a case-insensitive search.

The -w option matches only whole words.

The -l option lists only the files in which matches were found, but not the matching lines.

The -r (recursive) option searches files in the current working directory and all subdirectories below it.

The -n option lists the matching lines, together with line numbers.

bash$ grep -n Linux osinfo.txt 2:This is a file containing information about Linux. 6:The GPL governs the distribution of the Linux operating system.

The -v (or --invert-match) option filters out matches.
grep pattern1 *.txt | grep -v pattern2 # Matches all lines in "*.txt" files containing "pattern1", # but ***not*** "pattern2".

The -c (--count) option gives a numerical count of matches, rather than actually listing the matches.
grep -c txt *.sgml # (number of occurrences of "txt" in "*.sgml" files) # grep -cz . # ^ dot # means count (-c) zero-separated (-z) items matching "." # that is, non-empty ones (containing at least 1 character). # printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz . # 3 printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '$' # 5 printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '^' # 5 # printf 'a b\nc d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -c '$' # 9 # By default, newline chars (\n) separate items to match. # Note that the -z option is GNU "grep" specific. # Thanks, S.C.

When invoked with more than one target file given, grep specifies which file contains matches.

bash$ grep Linux osinfo.txt misc.txt osinfo.txt:This is a file containing information about Linux. osinfo.txt:The GPL governs the distribution of the Linux operating system. misc.txt:The Linux operating system is steadily gaining in popularity.

To force grep to show the filename when searching only one target file, simply give /dev/null as the second file.

bash$ grep Linux osinfo.txt /dev/null osinfo.txt:This is a file containing information about Linux. osinfo.txt:The GPL governs the distribution of the Linux operating system.

If there is a successful match, grep returns an exit status of 0, which makes it useful in a condition test in a script, especially in combination with the -q option to suppress output.
SUCCESS=0 # if grep lookup succeeds word=Linux filename=data.file grep -q "$word" "$filename" # The "-q" option causes nothing to echo to stdout. if [ $? -eq $SUCCESS ] # if grep -q "$word" "$filename" can replace lines 5 - 7. then echo "$word found in $filename" else echo "$word not found in $filename" fi

Example 29-6 demonstrates how to use grep to search for a word pattern in a system logfile.

Example 12-15. Emulating "grep" in a script

#!/bin/bash
# grp.sh: Very crude reimplementation of 'grep'.

E_BADARGS=65

if [ -z "$1" ]    # Check for argument to script.
then
  echo "Usage: `basename $0` pattern"
  exit $E_BADARGS
fi  

echo

for file in *     # Traverse all files in $PWD.
do
  output=$(sed -n /"$1"/p $file)  # Command substitution.

  if [ ! -z "$output" ]           # What happens if "$output" is not quoted?
  then
    echo -n "$file: "
    echo $output
  fi              #  sed -ne "/$1/s|^|${file}: |p"  is equivalent to above.

  echo
done  

echo

exit 0

# Exercises:
# ---------
# 1) Add newlines to output, if more than one match in any given file.
# 2) Add features.

How can grep search for two (or more) separate patterns? What if you want grep to display all lines in a file or files that contain both "pattern1" and "pattern2"?

One method is to pipe the result of grep pattern1 to grep pattern2.

For example, given the following file:

# Filename: tstfile This is a sample file. This is an ordinary text file. This file does not contain any unusual text. This file is not unusual. Here is some text.

Now, let's search this file for lines containing both "file" and "text" . . .

bash$ grep file tstfile
# Filename: tstfile
 This is a sample file.
 This is an ordinary text file.
 This file does not contain any unusual text.
 This file is not unusual.

bash$ grep file tstfile | grep text
This is an ordinary text file.
 This file does not contain any unusual text.

egrep - extended grep - is the same as grep -E. This uses a somewhat different, extended set of Regular Expressions, which can make the search a bit more flexible.

fgrep - fast grep - is the same as grep -F. It does a literal string search (no Regular Expressions), which usually speeds things up a bit.

On some Linux distros, egrep and fgrep are symbolic links to, or aliases for grep, but invoked with the -E and -F options, respectively.

Example 12-16. Looking up definitions in Webster's 1913 Dictionary

#!/bin/bash
# dict-lookup.sh

#  This script looks up definitions in the 1913 Webster's Dictionary.
#  This Public Domain dictionary is available for download
#+ from various sites, including
#+ Project Gutenberg (https://www.gutenberg.org/etext/247).
#
#  Convert it from DOS to UNIX format (only LF at end of line)
#+ before using it with this script.
#  Store the file in plain, uncompressed ASCII.
#  Set DEFAULT_DICTFILE variable below to path/filename.


E_BADARGS=65
MAXCONTEXTLINES=50                        # Maximum number of lines to show.
DEFAULT_DICTFILE="/usr/share/dict/webster1913-dict.txt"
                                          # Default dictionary file pathname.
                                          # Change this as necessary.
#  Note:
#  ----
#  This particular edition of the 1913 Webster's
#+ begins each entry with an uppercase letter
#+ (lowercase for the remaining characters).
#  Only the *very first line* of an entry begins this way,
#+ and that's why the search algorithm below works.



if [[ -z $(echo "$1" | sed -n '/^[A-Z]/p') ]]
#  Must at least specify word to look up, and
#+ it must start with an uppercase letter.
then
  echo "Usage: `basename $0` Word-to-define [dictionary-file]"
  echo
  echo "Note: Word to look up must start with capital letter,"
  echo "with the rest of the word in lowercase."
  echo "--------------------------------------------"
  echo "Examples: Abandon, Dictionary, Marking, etc."
  exit $E_BADARGS
fi


if [ -z "$2" ]                            #  May specify different dictionary
                                          #+ as an argument to this script.
then
  dictfile=$DEFAULT_DICTFILE
else
  dictfile="$2"
fi

# ---------------------------------------------------------
Definition=$(fgrep -A $MAXCONTEXTLINES "$1 \\" "$dictfile")
#                  Definitions in form "Word \..."
#
#  And, yes, "fgrep" is fast enough
#+ to search even a very large text file.


# Now, snip out just the definition block.

echo "$Definition" |
sed -n '1,/^[A-Z]/p' |
#  Print from first line of output
#+ to the first line of the next entry.
sed '$d' | sed '$d'
#  Delete last two lines of output
#+ (blank line and first line of next entry).
# ---------------------------------------------------------

exit 0

# Exercises:
# ---------
# 1)  Modify the script to accept any type of alphabetic input
#   + (uppercase, lowercase, mixed case), and convert it
#   + to an acceptable format for processing.
#
# 2)  Convert the script to a GUI application,
#   + using something like "gdialog" . . .
#     The script will then no longer take its argument(s)
#   + from the command line.
#
# 3)  Modify the script to parse one of the other available
#   + Public Domain Dictionaries, such as the U.S. Census Bureau Gazetteer.

agrep (approximate grep) extends the capabilities of grep to approximate matching. The search string may differ by a specified number of characters from the resulting matches. This utility is not part of the core Linux distribution.

To search compressed files, use zgrep, zegrep, or zfgrep. These also work on non-compressed files, though slower than plain grep, egrep, fgrep. They are handy for searching through a mixed set of files, some compressed, some not.

To search bzipped files, use bzgrep.

look

The command look works like grep, but does a lookup on a "dictionary", a sorted word list. By default, look searches for a match in /usr/dict/words, but a different dictionary file may be specified.

Example 12-17. Checking words in a list for validity

#!/bin/bash
# lookup: Does a dictionary lookup on each word in a data file.

file=words.data  # Data file from which to read words to test.

echo

while [ "$word" != end ]  # Last word in data file.
do
  read word      # From data file, because of redirection at end of loop.
  look $word > /dev/null  # Don't want to display lines in dictionary file.
  lookup=$?      # Exit status of 'look' command.

  if [ "$lookup" -eq 0 ]
  then
    echo "\"$word\" is valid."
  else
    echo "\"$word\" is invalid."
  fi  

done <"$file"    # Redirects stdin to $file, so "reads" come from there.

echo

exit 0

# ----------------------------------------------------------------
# Code below line will not execute because of "exit" command above.


# Stephane Chazelas proposes the following, more concise alternative:

while read word && [[ $word != end ]]
do if look "$word" > /dev/null
   then echo "\"$word\" is valid."
   else echo "\"$word\" is invalid."
   fi
done <"$file"

exit 0

sed, awk

Scripting languages especially suited for parsing text files and command output. May be embedded singly or in combination in pipes and shell scripts.

sed

Non-interactive "stream editor", permits using many ex commands in batch mode. It finds many uses in shell scripts.

awk

Programmable file extractor and formatter, good for manipulating and/or extracting fields (columns) in structured text files. Its syntax is similar to C.

wc

wc gives a "word count" on a file or I/O stream:
bash $ wc /usr/share/doc/sed-4.1.2/README 13 70 447 README [13 lines 70 words 447 characters]

wc -w gives only the word count.

wc -l gives only the line count.

wc -c gives only the byte count.

wc -m gives only the character count.

wc -L gives only the length of the longest line.

Using wc to count how many .txt files are in current working directory:
$ ls *.txt | wc -l # Will work as long as none of the "*.txt" files have a linefeed in their name. # Alternative ways of doing this are: # find . -maxdepth 1 -name \*.txt -print0 | grep -cz . # (shopt -s nullglob; set -- *.txt; echo $#) # Thanks, S.C.

Using wc to total up the size of all the files whose names begin with letters in the range d - h
bash$ wc [d-h]* | grep total | awk '{print $3}' 71832

Using wc to count the instances of the word "Linux" in the main source file for this book.
bash$ grep Linux abs-book.sgml | wc -l 50

See also Example 12-35 and Example 16-8.

Certain commands include some of the functionality of wc as options.
... | grep foo | wc -l # This frequently used construct can be more concisely rendered. ... | grep -c foo # Just use the "-c" (or "--count") option of grep. # Thanks, S.C.

tr

character translation filter.

Must use quoting and/or brackets, as appropriate. Quotes prevent the shell from reinterpreting the special characters in tr command sequences. Brackets should be quoted to prevent expansion by the shell.

Either tr "A-Z" "*" <filename or tr A-Z \* <filename changes all the uppercase letters in filename to asterisks (writes to stdout). On some systems this may not work, but tr A-Z '[**]' will.

The -d option deletes a range of characters.
echo "abcdef" # abcdef echo "abcdef" | tr -d b-d # aef tr -d 0-9 <filename # Deletes all digits from the file "filename".

The --squeeze-repeats (or -s) option deletes all but the first instance of a string of consecutive characters. This option is useful for removing excess whitespace.
bash$ echo "XXXXX" | tr --squeeze-repeats 'X' X

The -c "complement" option inverts the character set to match. With this option, tr acts only upon those characters not matching the specified set.

bash$ echo "acfdeb123" | tr -c b-d + +c+d+b++++

Note that tr recognizes POSIX character classes. [1]

bash$ echo "abcd2ef1" | tr '[:alpha:]' - ----2--1

Example 12-18. toupper: Transforms a file to all uppercase.

#!/bin/bash
# Changes a file to all uppercase.

E_BADARGS=65

if [ -z "$1" ]  # Standard check for command line arg.
then
  echo "Usage: `basename $0` filename"
  exit $E_BADARGS
fi  

tr a-z A-Z <"$1"

# Same effect as above, but using POSIX character set notation:
#        tr '[:lower:]' '[:upper:]' <"$1"
# Thanks, S.C.

exit 0

#  Exercise:
#  Rewrite this script to give the option of changing a file
#+ to *either* upper or lowercase.

Example 12-19. lowercase: Changes all filenames in working directory to lowercase.

#!/bin/bash
#
#  Changes every filename in working directory to all lowercase.
#
#  Inspired by a script of John Dubois,
#+ which was translated into Bash by Chet Ramey,
#+ and considerably simplified by the author of the ABS Guide.


for filename in *                # Traverse all files in directory.
do
   fname=`basename $filename`
   n=`echo $fname | tr A-Z a-z`  # Change name to lowercase.
   if [ "$fname" != "$n" ]       # Rename only files not already lowercase.
   then
     mv $fname $n
   fi  
done   

exit $?


# Code below this line will not execute because of "exit".
#--------------------------------------------------------#
# To run it, delete script above line.

# The above script will not work on filenames containing blanks or newlines.
# Stephane Chazelas therefore suggests the following alternative:


for filename in *    # Not necessary to use basename,
                     # since "*" won't return any file containing "/".
do n=`echo "$filename/" | tr '[:upper:]' '[:lower:]'`
#                             POSIX char set notation.
#                    Slash added so that trailing newlines are not
#                    removed by command substitution.
   # Variable substitution:
   n=${n%/}          # Removes trailing slash, added above, from filename.
   [[ $filename == $n ]] || mv "$filename" "$n"
                     # Checks if filename already lowercase.
done

exit $?

Example 12-20. Du: DOS to UNIX text file conversion.

#!/bin/bash
# Du.sh: DOS to UNIX text file converter.

E_WRONGARGS=65

if [ -z "$1" ]
then
  echo "Usage: `basename $0` filename-to-convert"
  exit $E_WRONGARGS
fi

NEWFILENAME=$1.unx

CR='\015'  # Carriage return.
           # 015 is octal ASCII code for CR.
           # Lines in a DOS text file end in CR-LF.
           # Lines in a UNIX text file end in LF only.

tr -d $CR < $1 > $NEWFILENAME
# Delete CR's and write to new file.

echo "Original DOS text file is \"$1\"."
echo "Converted UNIX text file is \"$NEWFILENAME\"."

exit 0

# Exercise:
# --------
# Change the above script to convert from UNIX to DOS.

Example 12-21. rot13: rot13, ultra-weak encryption.

#!/bin/bash
# rot13.sh: Classic rot13 algorithm,
#           encryption that might fool a 3-year old.

# Usage: ./rot13.sh filename
# or     ./rot13.sh <filename
# or     ./rot13.sh and supply keyboard input (stdin)

cat "$@" | tr 'a-zA-Z' 'n-za-mN-ZA-M'   # "a" goes to "n", "b" to "o", etc.
#  The 'cat "$@"' construction
#+ permits getting input either from stdin or from files.

exit 0

Example 12-22. Generating "Crypto-Quote" Puzzles

#!/bin/bash
# crypto-quote.sh: Encrypt quotes

#  Will encrypt famous quotes in a simple monoalphabetic substitution.
#  The result is similar to the "Crypto Quote" puzzles
#+ seen in the Op Ed pages of the Sunday paper.


key=ETAOINSHRDLUBCFGJMQPVWZYXK
# The "key" is nothing more than a scrambled alphabet.
# Changing the "key" changes the encryption.

# The 'cat "$@"' construction gets input either from stdin or from files.
# If using stdin, terminate input with a Control-D.
# Otherwise, specify filename as command-line parameter.

cat "$@" | tr "a-z" "A-Z" | tr "A-Z" "$key"
#        |  to uppercase  |     encrypt       
# Will work on lowercase, uppercase, or mixed-case quotes.
# Passes non-alphabetic characters through unchanged.


# Try this script with something like:
# "Nothing so needs reforming as other people's habits."
# --Mark Twain
#
# Output is:
# "CFPHRCS QF CIIOQ MINFMBRCS EQ FPHIM GIFGUI'Q HETRPQ."
# --BEML PZERC

# To reverse the encryption:
# cat "$@" | tr "$key" "A-Z"


#  This simple-minded cipher can be broken by an average 12-year old
#+ using only pencil and paper.

exit 0

#  Exercise:
#  --------
#  Modify the script so that it will either encrypt or decrypt,
#+ depending on command-line argument(s).

fold

A filter that wraps lines of input to a specified width. This is especially useful with the -s option, which breaks lines at word spaces (see Example 12-23 and Example A-1).

fmt

Simple-minded file formatter, used as a filter in a pipe to "wrap" long lines of text output.

Example 12-23. Formatted file listing.

#!/bin/bash

WIDTH=40                    # 40 columns wide.

b=`ls /usr/local/bin`       # Get a file listing...

echo $b | fmt -w $WIDTH

# Could also have been done by
#    echo $b | fold - -s -w $WIDTH
 
exit 0

Notes

[1]	This is only true of the GNU version of tr, not the generic version often found on commercial UNIX systems.

Prev	Home	Next
Time / Date Commands	Up	File and Archiving Commands