Main index

Introducing UNIX and Linux


Overview
Using filters
      Collating sequence
      Character classes
Character-to-character transformation
Selecting lines by content
      Regular expressions
      Basic regular expressions
      Extended regular expressions
      Grep
Stream editor
      Sed addresses
Splitting a file according to context
Choosing between the three filters
More on Vi
Summary
Exercises

Collating sequence

Before considering these filters we must digress with some remarks about characters. Specifically, we must ask the question: 'how are they ordered?' We have already remarked that to each character is assigned a code (normally the ASCII representation), and the ordering of characters corresponds with the numerical order of the codes. So, for instance, the code for b is one greater than the code for a. There are two possible problems with this: first, it is not necessarily the case that ASCII is being used, and secondly, the code representation - and ordering of characters - is different depending on which native language you speak. Although most UNIX systems use standard English/American, and a standard keyboard, POSIX allows for user interfaces consistent with other languages and equipment. Where, for instance, do accented letters fit in the alphabet, or completely different letters such as Greek? We therefore have a concept called a collating sequence which is a specification of the logical ordering for the character set you are using. In practice, this ordering applies just to letters and to digits, although it is defined for the whole character set. The collating sequence can be changed in POSIX by amending the locale.

In the following discussion we will refer to ranges, which are collections of characters that are consecutive within the collating sequence. A range is specified by a first and by a last character, separated by a hyphen. For instance,

b-z

refers to the characters between b and z, inclusive, in the current collating sequence. Characters come in various familiar flavours: there are letters, numbers, punctuation marks, and so on. These are character classes, and there is a notation for referring to these classes that is used by some utilities. The form this takes is a name of a class enclosed between [: and :].


Copyright © 2002 Mike Joy, Stephen Jarvis and Michael Luck