Main index

Introducing UNIX and Linux


Overview
Using filters
      Collating sequence
      Character classes
Character-to-character transformation
Selecting lines by content
      Regular expressions
      Basic regular expressions
      Extended regular expressions
      Grep
Stream editor
      Sed addresses
Splitting a file according to context
Choosing between the three filters
More on Vi
Summary
Exercises

Basic regular expressions

The general idea is just like pattern matching - a BRE consists of a sequence of characters, some of which have a special meaning. The BRE is said to match a string if

  • each part of the BRE with special meaning corresponds to a part of the other string, and
  • the other individual characters in the BRE and the string correspond.

In order to check whether a BRE matches a string, the two strings are examined working from left to right. Each time a match is found, the corresponding parts of the BRE and the string are discarded and the process continues immediately after. First of all, we consider how to specify a match for a single character. For this we use a BRE called a a bracket expression, which is an expression enclosed in square brackets ([]). The expression enclosed by the brackets is either a matching list or a nonmatching list. A matching list consists of a sequence of:

  • single characters (escaped, if necessary),
  • ranges (as described above for tr),
  • character classes (as for tr)

and a character matches a matching list if it matches any of the patterns that make up that sequence. The following BRE matches the letters a, x, y, z and any digit.

[ax-z[:digit:]]

If a matching list is preceded by a circumflex (^) it becomes a nonmatching list, and matches any character not specified in that list. So ^ corresponds to ! in pattern matching.

[^[:upper:]#]

will match any character that is neither an upper-case letter nor the symbol #. If you wish to specify the hyphen character in a range, you must have it as either the first or the last character in the bracket expression, so that

[-xyz]

will match x, y, z or -. A dot (.), when not enclosed in square brackets, matches any single character. To match a string containing more than one character, you can concatenate characters which you wish to match, dots and bracket expressions. So,

[Cc]hris

will match Chris or chris, and no other string;

[[:alpha:]]..

will match any 3-character string commencing with a letter. More generally, if you follow a bracket expression (or a single character or a dot) with an asterisk (*), that expression together with the * will match zero or more consecutive occurrences of the expression. So

[[:digit:]][[:digit:]][[:digit:]]*

will match any string consisting of two or more digits. The two characters ^ and $ are used to indicate the start and end of a string respectively, so

^A.*E$

will match any string commencing with A and terminating with E, including ANGLE and AbbreviatE but not DALE or Alpha.

Worked example

What BRE will match a string that is just a sequence of digits?
Solution: One digit is matched by [[:digit:]], zero or more digits are matched by [[:digit:]]*, and so [[:digit:]][[:digit:]]* will match one or more. The BRE will commence with ^ and end with $, to indicate that this is exactly what the string will contain, and will not have other characters at the start or at the end. The answer is therefore

^[[:digit:]][[:digit:]]*$


Copyright © 2002 Mike Joy, Stephen Jarvis and Michael Luck