Mission

This document is intended for relatively advanced UNIX users interested in manipulating text and data sets. The document assumes that users are working in the ksh, though the topics covered apply to other UNIX environments as well.

Each of the tools examined in this document utilize regular expressions to perform a task. Without them, data and text manipulation would be more difficult. The construction of regular expressions and their use can be a daunting task. This section should make them better understood and more manageable.

A regular expression is a text pattern that combines normal text characters and special characters, called metacharacters, to create a single entity that can have various interpretations. What does this mean? Well, it means that we can create a single text pattern that can have many fixed strings that match it. Okay, now what does this mean? It means that we can use a regular expression to identify all of the occurrences of the text pattern, based on some common feature (though the occurrences can be quite different). Let's look at an example.

Let's assume that we have a text file and we want to search through it for all occurrences of the words "pack," "puck," and "pick." We can do this with the grep utility (to be fully explained later) by searching for each of the fixed strings as follows:

By searching for each of these fixed strings, we will find all occurrences of the three words in the file specified as filename. This task, however, can be performed more efficiently with a regular expression, as demonstrated below:

The open and close brackets are special characters (again, metacharacters) interpreted by UNIX programs to mean that any one of the characters listed in the brackets will fulfill the search requirements. Thus, here complete our task in one command using a regular expression, when before it took us three commands to do the same thing using strings.

Metacharacters are interpreted differently by the UNIX shell (in this case, the ksh) than they are by pattern matching utilities like grep, awk, and sed. As a result the order in which metacharacters are interpreted is an important consideration. When commands are issued at the UNIX prompt, metacharacters are first seen by the shell and then by the program (grep, etc.). Here's an example:

The shell will read the above regular expression for filename expansion. It will look for any file in the current directory that starts with a lowercase character from a to z, followed by any string of characters (the * symbol is interpreted by the shell to mean any string of characters (including zero length strings)). So, assuming these files exist, this command could be interpreted as:

The grep utility would then try to find the pattern alpha.txt in the files beta.txt, lambda.txt, omega.c and somefile. This is not what was intended. To keep the shell from erroneously interpreting metacharacters for filename expansion, enclose your regular expressions in quotes (double quotes suffice in most cases, but single quotes are best). The command should look like this:

Some metacharacters are valid in one program but unsupported in another. To make these metacharacters more readily accessible, a brief table of the more common metacharacters appears below, along with their descriptions and the programs that support them.

Note: Remember, it is not necessary to include a metacharacter in a regular expression. If a fixed string serves the purpose, use it! Don't try to make things harder than they need to be.

The grep utility in UNIX is a powerful search tool that can examine the contents of files (or standard input) for a regular expression and print the line(s) in which the expression occurs. The basic syntax for grep is:

If no files are specified on the command line, grep will search standard input for the expression. Since many of the metacharacters used for pattern-matching (*, ?, etc.) are recognized as special characters by the shell, it is a good practice to enclose the regular expression in single quotes. The quotes force the shell to interpret the regular expression literally.

Let's look at a couple examples of the grep utility. Say you have a file, smilies.txt, and you want to search the file for occurrences of the word "basic." You would construct the command like this:

From time to time, you may want to search for a regular expression in UNIX standard input rather than a file. For instance, you may want to search all of the active processes on a machine for the number of pine email sessions currently running. To do so, you can use the ps command to list the current processes and pipe the output to the grep utility like so:

The sed program in UNIX is a command-line text editor. It differs from other UNIX text editors like pico or vi because it is a non-interactive editor. This essentially means that the contents of a file can be edited without making any changes to the original file. This is possible because sed is a stream-oriented editor. Sed operates by executing a script on a stream of data (typically the contents of a file) as it passes through the program. By default, the program's output goes directly to the screen but, should you want to save your changes, the output can also be directed to a file. The sed program is typically used in one of two ways.

There are numerous editing commands available to the sed program. The following table provides a brief description of some of the more common ones. A more complete description of the sed editing commands is available from the sed manual pages.

One of the most useful applications of the sed command is its ability to find and replace one regular expression for another. Let's say we have a file with multiple occurrences of the regular expression "smilie" and we want to replace them with the expression "smiley." We can try the following command to accomplish this task:

Let's examine the section in single quotes first. The initial "s" tells the sed command to perform the substitute command. The first expression between the slashes is the text string to be removed. The second expression between the slashes is the text string to be inserted. The remaining information on that line tells the sed command what file to edit (smilies.txt) and to redirect (>) the output to a file called smileys.txt.

By default, the s editing command only replaces the first occurrence of the specified regular expression on each line. Additional commands can be included to tell sed what to substitute. In the above example, the trailing g in the single quotes tells sed to perform the substitution for every occurrence of the string in the file. A number n (any value from 1 through 512) can also be used in place of the g to have sed only substitute the nth occurrence of the string on each line.

Normally the sed command expects to see a single edit command specified on the command line (as above). If you wish to perform a series of edits on a particular stream of data, you can automate the series of edits by creating a script. A sed script is a text file with a sed editing command listed on each line of the file. If you specify the -f flag, followed by the name of the scriptfile, sed will edit the file one line at a time, executing each command in the scriptfile where necessary. Suppose you want to delete the first fifteen lines of a file and print the phrase "first paragraph deleted" on the sixteenth line. To do so you could create a scriptfile (we'll call our file script) with the following information stored in it:

This command will take the input file, smilies.txt, and perform the set of commands listed in the scriptfile called script. The output would then be saved in a file named newsmilies.txt

There is another stream-editing command besides sed called tr. This command is used to copy standard input to standard output while making substitutions or deletions on the data passed.

It is important to remember, however, that tr reads from stdin and writes to stdout. Thus, if you want to use it to manipulate files you will need to use pipes or redirection. There are not many flags for tr; the two important ones are -d, which deletes characters in string1 from the output, and -s, which "squeezes" repeated occurrences of string2 from the output. Consider some examples.

One common use of tr is to change case, i.e. make all uppercase characters into lowercase or vice versa. One way to change uppercase to lowercase in a file would be as follows:

To save the output you would of course need to redirect from stdout. Another useful example is simply deleting all the occurrences of a given character from a file. Suppose you want to delete all the quote characters from a passage in a file called "excerpt" and save the output to "excerpt.revised". You would use the following:

One utility that works well with tr is od (octal dump). With od, you can discover the octal value of any character/cursor value in a given file. This information can then be used with tr. For example, the following file is formatted incorrectly:

With od, we can see that the problem is that there are consecutive line feeds (\n \n). The following od command gives us the octal characters for these lines:

The flags used in the previous command are two of the more common flags used with the od command. The -b flag tells the od command to display the input bytes in octal format and the -c flag tells the command to display the bytes in ASCII. The two character formats can then be compared to determine the ASCII character and its corresponding octal value. With this information, the following tr command can be used to reformat the text:

This command will "squeeze" out any repeated appearances of the line break (octal value is 012). Nothing is inserted in its place because the second string in the command has no value in the double quotes. The output looks like this:

Awk is a pattern-matching program designed for parsing and manipulating files, especially when the files are databases. There are multiple versions of awk available: the original awk (awk), a new version of awk (nawk) with some added functionality, and the GNU version (gawk) that is essentially the same as nawk. Awk allows you to produce formatted reports from databases, use variables to change the database, and to perform arithmetic and string operations, and more.

Awk reads input files, one line at a time, by dividing lines into a series of separate fields. By default, awk defines a field as a sequence of characters that does not contain a space or a tab. Different field separators can be specified by including the -Fc flag where the character c denotes the field separator. Thus, a line with fields separated by colons (:) can be parsed in awk using the -F: flag. Once fields have been defined for a line of input, awk identifies the fields by assigning variable names to them. The first field in a line of input is called $1, the second $2, and so on. The entire line is named $0. The number of fields can vary line by line depending on the data.

Awk, just as with sed, can be executed by specifying a command directly on the command line or by specifying (with the -f flag) a set of commands in a particular script file.

It isn't possible to give detailed descriptions of the awk variables and commands in this document because both sets are too large to adequately explain in this document. Rather, a few examples of data manipulation with awk will be provided to demonstrate its functionality. Look at the manual pages for additional information on the different awk utilities.

Let's say you want to find out which of the files in the current directory is larger than 50000 bytes and, once this has been determined, you want the owner name, file name, and file size (with the word "bytes" appended) printed to the screen.

Here we're listing all of the files in the current directory and piping the output to awk. Awk first looks at the fifth field in each line of input to determine if the value is larger than 50000. If so, awk prints the value in the third field, the value in the ninth field, the value in the fifth field, and the word "bytes" to the screen.

Another example involves validating the values or format of a data set. For instance, let's say that you have a data set (let's call it "data") that should have six fields of data per line and the value in the third field should never be equal to or below zero. You can create an error reporting script (let's call it error.script) that has these commands:

The first of these lines in the script counts the number of fields per line, returning the line number (NR is the awk system variable for record number) and an error message if the value is not equal to 6. The second line looks at the value in the third field and returns the line number and an error message if the value is equal to or less than 0. If there are no errors, awk returns no output. The command should read the awk procedures from the error.script file and apply them to the data file. It should look like this:

Awk can also be used to compute values and more precisely control the appearance of the output. Let's say that we have a class roll with the student name appearing in the first field ($1), and the scores to two tests in the next two fields ($2 and $3). Awk can add the scores in each column and compute the class average. In this example the printf command will be used to force awk to display the scores with up to three digits before the decimal and two after it. The command should look like this:

The first two bracketed lines tell awk to add up the values in the second and third fields and store them in variables named "test1" and "test2" respectively. The END statement separates these patterns from the procedures that will act upon the variables. The two print lines display text with a placeholder (%3.2f), telling awk how to format the number that will appear there. Both of the placeholders are followed by the line break characters (\n) to display the print statements on separate lines. Finally, we see the two variables divided by the NR system variable that counts the total number of lines.

Again, these are just a few of the many operations that can be performed by awk. To learn more, read the manual pages for awk or consult a general UNIX utilities resource for additional uses.

The diff utility compares two files and reports differences that exist between the two files. Diff prints out differing pairs of lines in the two files and provides codes to identify the changes necessary to make the lines identical.

When discrepancies are found, diff prints the lines from each file, flagging the file1 line with the < symbol and the file2 line with the > symbol. A good example of the diff utility in action can be seen in comparing the two files (smilies.txt and smileys.txt) we used with the sed command. These files are identical in all ways except for the use of "smiley" in one file and "smilie" in the other. Comparing these two files with diff should generate numerous occurrences of differing content. Here's an example of the output:

The first line from this sample of the output tells us that these lines appeared in line 9 of each file. Again, the line with the initial < symbol is from file1 and the one with the initial > symbol is from file2. The two lines are separated by a line with three dashes.

Typically, the diff utility is used to compare files but it can also be used to compare directories or files and directories. To do so, one would simply need to supply the names of directories in place of file names. Two useful flags when comparing directories are -r to run diff recursively, comparing files in any common subdirectories, and -s to report files that are identical in the two directories. Finally, if you give diff the name of a directory and the name of a file as the two arguments to be compared, diff will look in the directory for a file that corresponds with the other argument. Thus, executing the command diff newdir file is the same as diff newdir/file file.

Users with experience on microcomputers are probably familiar with the concepts of cutting and pasting. But cutting and pasting under UNIX works differently than under Windows/MacOS operating systems.

The cut command allows a user to extract columns from one or more files. The columns to be extracted can be specified as either character-width columns or as delimited fields. The basic syntax for a cut is this:

The columnlist and fieldlist specifications (only one of the two may be used) are lists of integers specifying which columns or fields to extract. Lists of nonconsecutive numbers must be separated by commas, and sequences can be represented by a hyphen. In addition, cut recognizes two other flags. To specify a field delimiter other than the default tab, use the -d character. For example, if the fields in the input line were delimited by colons, you would want to use the cut utility with the "-d:" flag. If there are lines which do not contain the delimiter (whether user-specified or the default), you can suppress their output by using the -s flag.

Let's assume that we have a database file, named "phone_list," with three fields delimited by colons rather than tabs. The three fields contain a name, a city, and a phone number. Each line in the database looks like this:

Note: If the fields in the previous example are delimited by spaces rather than colons, single quotes around a blank space must be used with the -d flag like this: cut -d' ` -f1,3 phone_list > newlist. Without the single quotes, the cut utility does not recognize the space as a field delimiter.

The paste command is used to merge files into columns. For example, if you had two files, one a list of names and the other a list of addresses, both in the same order, you could use the paste command to create a third file containing two columns. The default character used to separate columns is tab (press the <TAB> key once).

Output is sent to standard output, so saving will require redirection. Each file named on the command line becomes a column in the output. The hyphen (-) can also be used as a filename to denote the standard input. paste recognizes the following flags:

Another useful utility for merging files is join. This command lets a user take two files and merge records. This assumes that both files contain records composed of delimited fields and that both files are sorted by the first field. By default, join examines lines and compares the first field. When a match is found, join prints out the common field and the remainder of each record in the order in which the files were specified on the command line. Suppose you have two files called "numbers" and "cities". The first file contains:

Using the two files we listed earlier, suppose we want to have the phone number appear first, followed by the city and then the name. We could use the -o option to specify the field order in the output as follows:

The UNIX computing environment also supports the creation and use of shell scripts to automate repetitive tasks and to simplify the use of frequently used commands. A shell script is simply a text file containing commands that are interpreted by the shell.

The shells in UNIX are similar but they all have their own specific language. This document focuses on the ksh, our default shell is tcsh. Most UNIX users agree that Bourne shell derivatives (sh, ksh, bash) are superior for scripting. Fortunately, this is not a problem, you can simply invoke ksh (or any other shell) in the first line of your shell script. To do so, specify the name of the shell and its path, preceded by the #! symbols. For instance, to write a shell script using the ksh interpreter, use the following syntax in the script's first line:

To find out the location of the shells available on a UNIX system, use the command ``cat /etc/shells'' (which lists the file /etc/shells to stdout). Hint: most shell interpreters will be located in the /usr/bin/ directory.

Before you can run the script you created in your favorite UNIX text editor (pico, emacs, vi, etc.) you need to give the file executable permissions. To do so, you can use the chmod command in this fashion:

Once you have given the script the appropriate permissions, you need to decide how you want to run the script. You can run the script by simply typing the name of the script at the command line. This, of course, assumes that the script is located in a directory that is part of your command search path (i.e., your PATH variable). If not, the shell will return an error stating that it could not find the specified command (your script). To avoid this error, specify the absolute path to the script at the command line or, if the script is in your current directory, you can type ./scriptname to tell the shell to find your script in the current directory. With this general introduction to the creation of shell scripts, let's look at a couple of examples to better understand their use.

The for loop is one of the more simple loops used in shell scripting. It allows you to repeat a series of commands on a set number of files or parts of files. It's distinct, however, from for loops in some other programming languages because it doesn't allow you to specify how many times to iterate through the loop. Rather, other constructs such as while or if are needed to control the number of iterations made by the loop.

Let's suppose we have a database file that we use to store all of the entries in our addressbook. For the sake of this example, let's say that each line in the database file contains a person's name, phone number, and email address, all separated by colons. The addressbook file looks like this:

Now we want to write a shell script that will mail a message to each email address that we have in that file. The script needs to cycle through that file, strip out the address, and create a mail message for each address it finds. To do so, we can create a script like this:

The first line establishes the interpreter to use for the script: the ksh. This declaration should list the absolute path to the shell interpreter, preceded by the "#!" characters.

The second line does a number of things. The first word "for" initiates the loop. The next word "user" is the variable name that will be used later to perform loop's actions. The remainder of the line "in $(cut -d: -f3 filename)" declares the list of items to be acted on by the loop. We need to use the cut utility, described earlier in this document, to retrieve the email addresses from the file and omit the extraneous information.

The fourth line tells the shell to print each of the email addresses to the screen. The shell will do this because the "$user" notation stands for each instance of the variable in the list. The only reason to print the email addresses to the screen is to show you that the shell script is working its way through the loop.

The fifth line shows where the mail messages are actually created. Here, the mail function is being used by the shell. It is creating a mail message for each address listed as $user and is redirecting the text of the file named message for the body of our email message.

The final line of the script (the word "done") signals that the actions of the loop are finished. Once the loop has iterated through the list (all of the instances of $user), the script is finished and you will see a UNIX command line prompt again.

We can make the above script more portable by substituting the filenames with command-line variables. Currently, this little mailer script that we've written will only use the files that are named in the script. We can add the ability to specify these file names at the command-line by including positional parameters in the script. Positional parameters are denoted by the syntax, $1, $2, $3, and so on. The shell assigns the first positional parameter to the first argument listed on the command-line. The $0 positional parameter can be used to refer to the command used to invoke the name of the script (usually this will be the name of the script). To add this functionality, two subtle changes need to be made to the script:

We've replaced the name of the addressbook file and the message with the $1 and $2 positional parameters, respectively. To execute this script, we would type the following at the command-line:

where scriptname is the name of the shell script, filename1 is the name of the addressbook file, and filename2 is the name of the message to be redirected into the mail message.

Many of the commands described in this handout are documented more fully in the online manual pages. Use the man command to get more complete descriptions of them, especially for commands like sed and awk. In addition, there are several excellent books available which cover UNIX command syntax and usage.