Main index

Introducing UNIX and Linux


Files

Overview
The UNIX directory hierarchy
Filesystems
Manipulating files
      Creating directories
      Creating files
      links
      'Dot' files
Protecting files
      Groups
      File access control
      Changing privileges
File contents
      Text files
      Comparing files
      Filtering files
      Non-text files
Printing files
File archives and file compression
Other relevant commands
Summary
Exercises

Filtering files

Text files - especially ones containing 'raw data' - often contain repeated lines. It is sometimes useful to know either how often this occurs, or to filter out the repeated occurrences. The command uniq ('unique') is provided for this purpose. For instance, supposing file A contains

aaa
bbb
bbb
bbb
bbb
bbb
ccc
ccc
aaa
ddd

then the following dialogue might take place:

uniq A
aaa
bb
ccc
aaa
ddd
uniq -c A
      1 aaa
      5 bbb
      2 ccc
      1 aaa
      1 ddd

With no options, uniq simply filters out consecutive repeated lines; option -c ('count') prepends each line of output with a count of the number of times that line was repeated. Option -d ('duplicate') causes uniq only to write out lines that are repeated, and -u ('unique') only to write out lines that are not repeated consecutively. Thus:

uniq -d A
bbb
ccc
uniq -u A
aaa
aaa
ddd

Another common situation arises when you have two or more files, containing what can be thought of as columns in a table. You require corresponding lines from the files to be concatenated so as to actually produce a table. Using the command paste will achieve this - corresponding lines of its arguments are joined together separated by a single TAB character. For example, suppose file A contains

hello
Chris

and file B contains

there
how are you?

then the following dialogue can take place:

paste A B
hello there
Chris how are you?

Both paste and uniq, though only of use in limited situations, save a great deal of time editing files when they can in fact be used.

Sometimes, when dealing with files that are presented in a rigid format, you may wish to select character columns from such a file. The utility cut is a very simple method for extracting columns. Suppose we have a file myfile containing the following data (dates of birth and names):

17.04.61 Smith Fred
22.01.63 Jones Susan
03.11.62 Bloggs Zach

We can choose the years from each line by selecting character columns 7 to 8, thus:

cut -c7-8 myfile
61
63
62

This command can also distinguish between fields (where a line is thought of as divided into fields separated by a known delimiter), and to select family names from myfile (Smith, Jones and Bloggs), we could use cut -f2 -d' ' myfile, which specifies that we select field number 2 where the delimiter (option -d) is the space character:

cut -f2 -d' ' myfile Smith
Jones
Bloggs

Related to cut is fold; cut will assume that you want the same number of lines in the output as the input, but you wish to select part of those input lines. On the other hand, fold assumes that you want all of your input, but that your output needs to fit within some lines of maximum width - for example, if you had a file with some very long lines in it that you needed printing on a printer that was fairly narrow. The action performed by fold is to copy its standard input, or names mentioned as arguments, to standard output, but whenever a line of length greater than a certain number (default 80 characters) is met, then a Newline character is inserted at that point. With option -w ('width') followed by a number, that number is taken to be the maximum length of output lines rather than 80. Try the following:

fold -w 15 <<END
Let's start
with three
short lines
and finish with an extremely long one with lots of words
END

For more sophisticated processing of files divided into records and fields we can use Awk (see the chapter on Awk later).

Another exceptionally useful command is sort, which sorts its input into alphabetical order line-by-line. It has many options, and can sort on a specified field of the input rather than the first, or numerically (using option -n) ('numerical') rather than alphabetically. So using file A above, we could have:

sort A
aaa
aaa
bbb
bbb
bbb
bbb
bbb
ccc
ccc
ddd

A feature of uniq is that it will only filter out repeated lines if they are consecutive; if we wish to display each line that occurs in a file once and only once, we could first of all sort the file into an order and then use uniq:

sort A | uniq
aaa
bbb
ccc
ddd

This has the same effect as using sort with option -u, which we have already mentioned.

Worked example

Find out how many separate inodes are represented by the files (excluding 'dot' files) in the current directory.
Solution: Using ls -i1 we can list the files, one per line, preceded by their inode number. Piping the output into cut we can isolate the first six character columns, which contain the inode number, and sort with option -u, which will sort these into order and remove all duplicates. These can then be counted by counting the number of lines of output using wc -l:

ls -i1 | cut -c1-6 | sort -u | wc -l


Copyright © 2002 Mike Joy, Stephen Jarvis and Michael Luck