Introducing UNIX and Linux

Awk

Overview
What is 'awk'?
Invoking 'awk'
Naming the fields
Formatted output
      Operators used by Awk
Patterns
Variables
      Accessing Values
      Special variables
Arguments to 'awk' scripts
Arrays
Field and record separators
Functions
      List of Awk functions
Summary
Exercises

Field and record separators

The fields in a record are normally separated by whitespace. This is not always convenient. Suppose a file (ages, say) contains a list of people's names and their ages:

John 13
Sue 12
James Smith 15
James Jones 14

The number of fields on each line varies. This is a potential problem. Let us suppose we wish to write a simple Awk script to display

John is 13 years old
Sue is 12 years old
James Smith is 15 years old
James Jones is 14 years old

There are several possible solutions. One that you will already be able to find checks the number of fields and performs a separate action each time:

NF == 2 { printf "%s is %d years old\n", $1, $2 }
NF == 3 { printf "%s %s is %d years old\n", $1, $2, $3 }

This solution is fine if you know how many names a person is likely to have - but it is not elegant since there is a lot of duplication in the Awk script. If you were to allow persons with many forenames to appear in the list the Awk script would become unmanageable. Loops, such as for and while loops, are provided in Awk, and although we do not discuss them here, they could be used to 'count over' the first few fields. However, the solution begins to get moderately complex if that method is adopted. The reason the Awk scripts to perform this apparently simple task are less straightforward than you might expect is that the data has been coded unwisely. The fields are separated by characters which themselves appear in one of the fields, namely blanks. If the data had been

John:13
Sue:12
James Smith:15
James Jones:14

so that a colon (say) was used to separate the names from the numbers, then each line would have precisely two fields, and the spaces in the names would not matter. We can instruct Awk to use a different field separator to the usual whitespace by resetting the value of the variable FS; this should be done at the very start of the Awk script. Create a file called ages with the above names and ages in the 'colon-separated' format, and run the following Awk script:

BEGIN { FS=":" }
{ printf "%s is %d years old\n", $1, $2 }

The field separator can be any ERE, and can also be changed by giving awk the option -F followed by that ERE. For instance, to allow a sequence of one or more blanks, commas and colons to separate fields, you might have

awk -F "[ ,:]+"

On your UNIX system there should be a file called /etc/passwd which contains information about users on your system. This file consists of a sequence of lines which look like:

chris:hi64MH4uhJiq2:1623:103:Chris Cringle:/home/ugrad/chris:
sam:a8PyPVSiPVXT6:1628:103:Sam Smith:/home/ugrad/sam:/bin/sh
jo:9gqrX4IOig7qs:1631:103:Jo Jones:/home/ugrad/jo:/bin/sh
geo:58esMw4xFsZ9I:1422:97:George Green:/home/staff/geo:/bin/sh
   ...

Each line contains seven colon-separated fields; these represent the following:

A user's username (e.g. chris)
That user's encrypted password (e.g. hi64MH4uhJiq2). Passwords are usually stored in a coded form; if you know a password, it's easy to encrypt it, but virtually impossible to take an encrypted password and decode it. So it's safe for the encrypted passwords to be accessible by everyone. Having said this, some UNIX implementations - especially networked systems - impose a higher degree of security and do not allow the encrypted passwords to be accessed. In that case, the second field will be replaced by some other value.
The user's user-id.
The user's group-id.
The user's 'real' name; sometimes this field will also include other information, such as the user's office phone number or course of study.
The user's home directory.
The user's login shell (if empty, defaults to /bin/sh).

Some systems which 'hide' the encrypted passwords will also have another mechanism for storing the data normally in /etc/passwd. If you find that this file either does not exist, or does not contain the information just described, then it is likely to be available using a special command. A common method of organising users' data over a network uses a system called NIS. To display the password file using NIS you should type

ypcat passwd

and the data will be sent to standard output.

Worked example

Using Awk and /etc/passwd write a shell script findname to take an argument, which is a usercode, and display the name of the user who owns that usercode.
Solution: We need to look at fields 1 and 5 of the password file; if field 1 is the shell script argument we display field 5.

# As usual, make sure the script has one argument ...
if   [ $# -ne 1 ]
then echo "findname requires one argument"
     exit 1
fi

awk '
      # Set field separator to :
      BEGIN { FS=":" }
      {
        # Is the first field the usercode?
        if ($1 == usercode)

          # If yes, print out field 5, the user's name
          printf "%s\n", $5 }
    
    ' usercode=$1 < /etc/passwd

Just as we can specify what should separate fields within a record, so we can specify what should separate records. Unless otherwise specified, a record is a line of input, so the record separator is the Newline character. The special variable used to change this is RS.

Worked example

Write an Awk script to read standard input containing a list of company names and phone numbers, together with other information. All companies in the input with the keyword Anytown as part of their data should be displayed. The data for each company should be separated by a single line containing a single % symbol:

Toytown Telecom
Birmingham
0121 123 4567
Sells phones and answering machines
%
Sue, Grabbit and Runne
Solicitors
London
020 7999 9999
%
Chopham, Sliceham and Son
Anytown 234
family butchers

So with this data, the output would be:

Chopham, Sliceham and Son
Anytown 234
family butchers

Solution: Set the record separator to a %.

BEGIN     { RS="%" }     # Set RS
/Anytown/ { print $0 }'  # Print records matching "Anytown"

You must be very careful if you reset the record separator. If the Newline character is no longer the record separator, any Newlines will be a part of the record. Unless the field separator is an ERE which allows a Newline, it will also be part of one of the fields. You will seldom need to reset the record separator.

Although the function print has been mentioned briefly, we have so far used the function printf as the usual means of displaying output from awk. This is because printf is very flexible. For simple output, print can be 'tailored' to individual requirements by use of the output field and output record separators OFS and ORS. When print takes several arguments, they will be printed out separated by the value of OFS (normally Space), and each record will be terminated by ORS (normally Newline).

Worked example

Write an Awk script to read in the password file and display users' names and home directories, in the following format:

Chris Cringle has home directory /home/ugrad/chris.
Sam Smith has home directory /home/ugrad/sam.
  ...

Solution: Use print to display the fifth and sixth fields of /etc/passwd. Set the input field separator to a colon, the output field separator to

has home directory

and the output record separator to Newline.

awk ' BEGIN { FS=":"
              OFS=" has home directory "
              ORS="\n" }
 { print $5,$6 }' </etc/passwd