Main index

Introducing UNIX and Linux


Overview
Using filters
      Collating sequence
      Character classes
Character-to-character transformation
Selecting lines by content
      Regular expressions
      Basic regular expressions
      Extended regular expressions
      Grep
Stream editor
      Sed addresses
Splitting a file according to context
Choosing between the three filters
More on Vi
Summary
Exercises

Grep

We have defined regular expressions; in order to use them, we begin with a utility called Grep. The function of Grep is to select lines from its input (either standard input or named files given as arguments) that match a BRE normally given as first argument to grep. The BRE is known as a script. Those lines of input that match the BRE are then copied to standard output. For instance, to print out all words ending in ise or ize from /usr/dict/words, you could have:

grep 'i[sz]e$' /usr/dict/words

Note that single quotes are needed here as $ is in the BRE.

With option -E, grep will use EREs instead of BREs. With option -F, grep uses only fixed strings - there are no regular expressions, the string given as argument to grep is matched against the input exactly as it appears. With option -c instead of copying matched lines to standard output, a count of the number of matched lines is displayed instead.

Worked example

How many words in /usr/dict/words begin with a vowel?
Solution: Use grep with option -c, to select and then count lines beginning with upper-case or lower-case vowels. The BRE contains a list of all such vowels, preceded with a ^ to indicate that the vowel must be at the start of each word:

grep -c '^[AEIOUaeiou]' /usr/dict/words

On some systems separate commands egrep ('Extended GREP') and fgrep ('Fixed GREP') are used instead of grep -E and grep -F.

Option -i ('insensitive') causes grep to ignore the case of letters when checking for matches, and overrides any explicit specification regarding upper-case and lower-case letters in the regular expression. Thus a solution to the previous worked example could be:

grep -ci '^[aeiou]' /usr/dict/words

With option -f ('file') followed by a filename, regular expressions contained in that file are used instead of being given as an argument to grep. If the file contains more than one regular expression, then Grep selects lines that match any of the REs in the file. This is the preferred method by which Grep can select lines where there is a choice of matching specifications.

The 'reverse behaviour' - namely displaying those lines not matching the RE specified - can be enabled with option -v ('inVert'). This is often simpler than constructing a new regular expression. An example of this being useful might be to a FORTRAN programmer. A program written in the computer language FORTRAN treats any line starting with a C as a comment; if you were examining such a program, and wished to search for lines of code containing some identifier, and were not interested in the lines of comments, you might wish to use

grep -v '^C'

to strip out the comments to begin with.

If grep is given several files as arguments, option -l ('list') displays a list of those files containing a matching line, rather than those lines themselves.

Worked example

Suppose you have saved many mail messages in files in the current directory, and you want to check which file or files contain messages whose subject is something to do with 'examinations'. Each mail message contains a line beginning with the string Subject: followed by the subject of the message (if any).
Solution: We require grep -l followed by a BRE followed by * to list the filenames. The following lines might occur as the 'subject' lines of the messages:

Subject: Examinations
Subject: examinations
Subject: NEXT MONTH'S EXAMS
Subject: Exams

These all have a common string, namely exam, in upper-case or lower-case (or a mixture of cases). So, to match these lines, a BRE is required to recognise Subject: at the start of the line, followed by some characters (possibly none), followed by exam in any mixture of cases. The Subject: at the start of the line is matched by ^Subject and .* matches the characters between that and exam. In order to ensure that the cases of the letters in exam do not matter, you can either explicitly match them with [Ee][Xx][Aa][Mm], or you can instruct grep to be 'case-insensitive' with option -i. The following two solutions would be acceptable:

grep -l '^Subject: .*[Ee][Xx][Aa][Mm]' *
grep -li '^Subject: .*exam' *

Note that this is not an infallible solution. It will also select files with subjects related to counterexamples and hexameters, and will not find a file with subject examinations. When using UNIX tools to process data from electronic mail or other documents containing English text, you must be conscious of human fallibility. Some solutions will of necessity be approximate.


Copyright © 2002 Mike Joy, Stephen Jarvis and Michael Luck