The AWK Programming Language: An Introduction

A hands-on tutorial for learning AWK on Linux

Introduction

The AWK utility is known for running programs that perform text processing and data extraction, and is often pre-installed on Linux and UNIX based operating systems. This article adopts a hands-on approach to introduce the AWK programming language. Many sample programs are provided as new AWK concepts are discussed throughout. After completing this article, you will have a good understanding of how to write AWK programs to process, extract and report data on your system.

The first version of the AWK language was created at Bell Labs in 1977. The “AWK” acronym is derived from the surnames of its creators — Alfred Aho, Peter Weinberger, and Brian Kernighan — and shares the same pronunciation as the auk bird. The lowercase version, “awk”, refers to the program that runs scripts written in the AWK programming language. This type of program is known as an interpreter.

Since its initial release, AWK has been forked and developed further by multiple groups of people. As a result, several versions of AWK exist today each with varying implementations. The programs presented in this article are written for the GNU Awk implementation, also known as Gawk.

Gawk is installed via the gawk package using your distribution’s package manager. It will probably be installed on your Linux distribution already and linked to the awk command. In other words, the awk and gawk commands will invoke the same program.

Use your distribution‘s package manager to install Gawk if it’s not already on your system:

$ sudo pacman -S gawk       # Arch
$ sudo apt install gawk # Debian and Ubuntu
$ sudo dnf install gawk # Fedora and CentOS

We can also confirm the install location and the fact awk is a symbolic link that points to the gawk binary:

# Output gawk location
$ which gawk
-| /usr/bin/gawk
# Output awk symbolic link location
$ which awk
-| /usr/bin/awk
# Confirm awk is a symbolic link
$ ls -l $(which awk)
-| lrwxrwxrwx 1 root root 4 Apr 15 2020 /usr/bin/awk -> gawk

Lastly, run awk with some basic command-line options to get acquainted with the tool:

# Output version (--version)
$ awk -V
-| GNU Awk 5.1.0 ...
# Output license (--copyright)
$ awk -C
-| Copyright (C) 1989, 1991-2020 Free Software Foundation. ...
# Output usage (--help)
$ awk -h
-| Usage: gawk [POSIX or GNU style options] -f progfile ...

Now you have Gawk installed, let’s proceed to the next section which discusses how to run AWK programs!

Running AWK Programs

An AWK program is data-driven, which means that it performs a set of actions on the textual data it receives. There are two methods for supplying data to a program: Specifying a list of filenames, and piping output from another command. These approaches can be observed with some pseudo code:

# Providing input files
awk 'program' input-file1 input-file2
# Piping the output of a command
command | awk 'program'

In addition, the program portion can be either:

  • A one-shot throw away program that is quick to type and specified on the command line.
  • A script defining a sophisticated program that performs more complex data processing.

The best approach to take will depend on your program’s overall complexity and whether you want to re-use it in the future. The following pseudo code illustrates how to provide an AWK program:

# Command line
awk 'program' input-file1 input-file2 ...
# Program file
awk -f program-file input-file1 input-file2 ...

Notice that a program specified directly on the command line is surrounded in single quotes. Alternatively, you can provide a program file with the -f flag.

It is also a good idea to append the .awk file extension to your scripts for a couple of reasons. It enables editors to render syntax highlighting for the AWK programming language correctly, and it immediately conveys the nature of the file when it's listed in a file manager or terminal.

Here are some quick one-shot example programs invoked directly on the command line. Feel free to run each one in your terminal, and don’t worry if the syntax makes no sense at this point:

# Output total number of fonts in /usr/share/fonts
$ fc-list | awk 'END { print "Total fonts:", NR }'
# Output all (hidden) files starting with ".bash"
$ ls -la | awk '$NF ~ /^.bash/ { print $NF }'
# Output all filenames that contain more than 10 characters
# (minus file extension)

$ ls | awk '{ i=index($0,"."); i>0?s=substr($0,1,i-1):s=$0;
if(length(s)>10) print }'

Throw-away programs can also process text files:

# Create a text file listing system fonts
$ fc-list > font-list
# Provide the text file and output total number of fonts
$ awk 'END { print "Total fonts:", NR}' font-list

Programs specified on the command line can be more than one line, but you will need to use the backslash character to separate each one:

$ ls -l | awk 'BEGIN { print "About to count files..." } \
{ ++count } \
END { print "Total files: ", count }'

These types of programs are great for getting information quickly and experimentation. However, writing programs in text files means there is no limit to the complexity or length of your program.

We will use a script provided in this article’s corresponding GitHub repository to discuss the concepts in this section. You can download the GitHub repository containing this article’s code examples using git on the command line:

$ git clone https://github.com/danebulat/awk-samples.git

The program discussed in this section is located at samples/filemod.awk. The script renames files in the current working directory by converting all the characters to either lowercase or uppercase. When no command-line options are specified, all filenames are converted to lowercase:

# Enter samples directory
$ cd awk-samples/samples
# Add executable permissions to the scripts
$ chmod +x *.awk
# Invoke script
$ ls -la | ./filemod.awk

After piping ls -la to ./filemode.awk you should see some formatted output informing you of any modifications to the filenames:

Running the samples/filemod.awk program.

Let’s discuss the commands that were necessary to run the script:

  • chmod +x *.awk
    In order to use the ./program.awk shorthand syntax, the scripts must have executable permissions.
  • ls -la | ./filemod.awk
    The program receives output from ls -la and disregards lines representing hidden files because dotfiles should not be renamed. The initial total line is also skipped.

If you want to invoke programs using the ./program.awk syntax, a shebang character sequence must also be specified at the top of a script pointing to your AWK binary, followed by the -f flag:

#! /usr/bin/awk -f

Alternatively, the longer invocation method is to specify the awk utility and pass a script’s filename to the -f flag explicitly:

$ ls -la | awk -f filemod.awk

Running a script this way doesn’t require it to have executable permissions, nor does the shebang character sequence have to be specified at the top of the file. However, I would recommend using the shorthand syntax because less typing is required.

Our program also accepts a variable assignment on the command line which instructs it to perform either lowercase or uppercase conversion:

$ ls -la | ./filemod.awk -v c=l    # lowercase conversion
$ ls -la | ./filemod.awk -v c=u # uppercase conversion

Variables passed to a program from the command line must appear after the -v option. If a value other than l or u is assigned to the c variable, or we don't assign any variables, the program converts to lowercase by default.

Patterns and Actions

An AWK program is made up of rules, and rules are made up of a pattern with associated actions. You therefore build an AWK program by defining a list of rules that have the following structure:

# AWK program
pattern { actions } # Rule
pattern { actions } # Rule

Each rule can specify one pattern in addition to a set of actions. The following subsections will discuss this workflow in more detail.

A rule doesn’t have to include any actions if a pattern is defined. Likewise, it doesn’t need to define a pattern if an action is defined. However, one of them is needed for the program to do some processing on the data it receives.

A pattern is specified before the action code, and can be either a conditional expression or regular expression:

# Regular expression examples
$ ls -l | awk '/(\.py)&/'
$ ls -l | awk '/README/'
# Conditional expression examples
$ ls -l | awk '$NF == "README.md"'
$ ls -la | awk '$NF ~ /^\./ { printf $NF }'

Let’s take a look at each pattern in the examples above:

  • /(\.awk)$/
    Forward slashes are used to denote regular expressions in AWK. Within them, you can input standard regular expression characters to define your pattern. In this example, all files with the .awk extension are matched.
  • /README/
    No special characters are used within this pattern, which means if an input record contains the text README, it will be matched and the corresponding action block will execute if one is defined.
  • $NF == "README.md"
    A pattern can also be a conditional expression using the same operators found in C and C++. For example, this pattern checks the value inside the predefined AWK variable $NF, which references the last field in the current input record. If $NF contains the string README.md, the pattern matches and the associated action block is executed.
  • $NF ~ /^\./
    The regular expression /^\./ matches on strings that begin with a period character, representing a hidden file. The ~ operator performs a comparison on the string inside the $NF variable with the regular expression. Note that the !~ operator is also available.

A pattern needs to return “true” in order for the action block to run. AWK understands truthness in the following way:

  • True is defined as any nonzero numeric value, or any nonempty string.
    I.e. "Hello, world!", 10, “Three”, "0" are all true.
  • False is defined as zero or the null string.
    I.e. "" and 0 are false.

With the understanding that a rule is executed when its pattern matches the current input record, let’s proceed to discussing actions next.

AWK will always run the action block if a pattern is not provided. If a pattern is provided, it needs to return a true value in order for the action block to run.

An action is always defined inside curly braces {...} and consists of one or more statements. Either a semi-colon (;) or a newline separate statements in AWK. A one-shot program will use semi-colons to separate statements:

$ ls -l | awk '{ print "Filename:", $NF; print "\tOwner:", \
($3 == "" ? "--" : $3) }'

However, a script will likely separate statements on separate lines to properly format the code, making it more readable:

The samples/statements.awk script.

In the script above, the print function is called to format and send data to the standard output. This program is saved as samples/statements.awk in the source files and can be invoked on the command-line:

$ ls -l | ./statements.awk

In the event that an action is not specified for a rule, the whole input record is sent to standard output where the following action code is run implicitly:

{ print $0 }

The $0 here is another predefined variable which stores the string of the current input record. If you are wondering why you are seeing output after not defining any actions in your AWK programs, this is what's going on behind the scenes.

Regarding the sample/filemod.awk program, three rules are defined where two of them use special patterns:

  • BEGIN { ... }
    This rule uses the special BEGIN pattern. It causes the rule to execute exactly once when the program starts, before the first input record is read.
  • $NF !~ /^\./ && $1 !~ /^(total)/ { ... }
    This rule processes the input record providing it doesn't represent a hidden file or begins with the string total. Two regular expression comparisons are performed with the !~ operator to accomplish this.
  • END { ... }
    This rule uses the special END pattern. It causes the rule to execute exactly once after all input is read and before AWK exists.

Let’s take a closer look at the BEGIN and END patterns in the next section.

The BEGIN and END patterns are useful for providing startup and cleanup actions for your program. For example, a BEGIN block could print headings if a program renders table data, and an END block could output summary information. The samples/begin.awk program demonstrates the BEGIN and END patterns:

The samples/begin.awk script.

This script renders Filename and Permission column headings inside the rule containing the BEGIN pattern. Notice that the printf function is used for making the first column a minimum of twenty characters wide.

Additionally, the rule containing the END pattern outputs the value of the $NR predefined variable, which stores the number of input records that have been processed so far. Feel free to run the script in the following way and observe its output:

$ ls -l | ./begin.awk

Input Records and Fields

AWK processes data in units called records. Each record is separated by the newline character (\n) — meaning a record is one line of input by default. Additionally, AWK stores the current input record in a predefined variable called $0, which can be referenced in your programs:

# Output input records
$ ls -l | awk '{ print "Current record:" $0 }'
# Same behavior with text files
$ ls -l > file-list
$ awk '{ print "Current record:", $0 }' file-list

The character used for separating records is stored in a predefined variable called the record separator denoted by RS in your programs. This variable can be modified in situations where input needs to be separated differently:

# Treat each directory as an input record
$ pwd | awk 'BEGIN { RS="/" } { print "-->", $0 }'

Notice that the record separator character is not actually included in the input record. Another three useful predefined variables are available in regards to record processing:

  • FILENAME
    Contains the name of the current input file.
  • FNR (File Number of Records)
    How many records have been read from the current input file.
  • NR : (Number of Records)
    How many records have been read in total.

The samples/records.awk program uses these variables to output the first ten records within the text files passed to the program:

The samples/records.awk script.

The program checks if FNR contains the value 10 after outputting the current record. If it does, a message is displayed and the nextfile function is called to process the next input file. To invoke the program, simply pass some input files on your system such as /etc/passwd and /etc/group:

$ ./records.awk /etc/passwd /etc/group
-| Outputs first 10 records in /etc/passwd and /etc/group

Each record is automatically split into chunks called fields. Similar to records, fields are separated by the value of a predefined variable called the field separator denoted in your programs with FS.

The default value of FS is a space character, which carries the meaning that fields are separated by runs of white space - such as leading and trailing space and TAB characters.

Field separation can be observed by outputting the value of the NF, which is a predefined variable that stores the number of fields each record contains:

$ ls -l | awk '{ print "Number of fields in record:", NF }'
-| Number of fields in record: 2
Number of fields in record: 9
...

Individual fields can also be accessed with special variables in the form of $N where N is the field index. The first field in a record has an index of one, and is therefore accessed with $1. The second field is referenced with $2, and so on. The last field can also be accessed with the $NF variable, which is useful if you don't know the actual index value. Let's take a look at an example:

# Output filenames and modification dates
$ ls -l | awk '{ if (NR>1) print $NF, "\tmodified on", $6, $7 }'

The field separator will sometimes need modifying if data received by the program is formatted differently. For example, fields inside the /etc/passwd and /etc/group configuration files are separated with the colon (:) character. To access these fields correctly within an AWK program, the field separator needs to be set accordingly:

# Output system users and their home directory
$ awk 'BEGIN { FS=":" } { printf "%-30s%s\n", $1, $6 }' /etc/passwd

The printf function is used again here to format columns appropriately. As well as within the BEGIN block, FS can also be set on the command line through the -F option:

$ awk -F: '{ printf "%-30s%s\n", $1, $6 }' /etc/passwd

As with the record separator, It is possible to assign a regular expression to the field separator where fields are separated by occurrences of characters that match the regular expression. This is an advanced feature but mentioned here for completeness.

The following two tables summarize the behavior of the RS (Record Separator) and FS (Field Separator) variables depending on their assigned value. Feel free to refer to them if you need a refresher.

Possible values for the RS variable.
Possible values for the FS variable.

Printing Output

One of the strengths of the AWK programming language is its ability to efficiently format data and produce meaningful reports. This functionality is provided by the print() and printf() functions in combination with a few predefined variables.

The print() function is used to output text with simple, standardized formatting. You will probably use this function if you just want to output some variables and strings in a linear way. The syntax is also very simple:

# Items separated with commas
print item1, item2, ...
print(item1, item2, ...)
# Items separated without commas
print item1 item2 item3
print(item1 item2 item3)

The syntax presented above illustrates a couple of important points:

  • You can provide optional parenthesis when calling print. Parenthesis are only required if redirection is performed with the > operator.
  • Items can be separated with or without a comma character.

While separating items with and without a comma are both syntactically correct, AWK will process your arguments differently depending on which method you choose. let’s take a look at an example:

# With commas
$ awk 'BEGIN { print "Hello", "World" }'
-| Hello World
# Without commas
$ awk 'BEGIN { print "Hello" "World" }'
-| HelloWorld

From the output of these commands we can deduce that items are separated with a space character if a comma is used, and simply concatenated if a comma is not used.

As a matter of fact, you don’t even need to separate items if you just want to concatenate them:

$ awk 'BEGIN { print "Hello""World" }'

Although this works, I would recommend separating items with a space to make your code more readable!

The fact that a space is output between items when a comma is specified in print is no coincidence. The space character is actually the default value of a predefined variable called the output field separator denoted by OFS in your programs. This variable can be set to any string if you require something other than a space to separate your items:

# Setting the output field separator
$ awk 'BEGIN { OFS=" ---> "; print "Hello", "AWK", "World"}'
-| Hello ---> AWK ---> World

AWK also provides a predefined variable called the output record string which is denoted by ORS in your programs. The string stored in this variable is a newline character by default (\n) and is output before the print function exits. The ORS variable can be set to any string if you require something other than a newline character to separate print output:

# The default ORS is a newline character
$ awk 'BEGIN { print "Hello"; print "World" }'
|- Hello
World
# Setting ORS to a new string
$ awk 'BEGIN {ORS=" [END OF RECORD]\n";print "Hello";print "World"}'
|- Hello [END OF RECORD]
World [END OF RECORD]

The last predefined variable we will look at in this section is called the output format and is denoted by OFMT in your programs. It stores a string representing a format specifier that controls how numeric output is displayed with the print function. The default value is %.6g:

  • %g : Prints a number in either scientific notation or in floating-point notation - which ever uses the fewest characters.
  • .6 : A period followed by an integer representing the precision. Signifies the maximum amount of significant digits when used with %g format specifier.

The OFMT variable can be modified just like ORS and OFS:

# Default OFMT is "%.6g"
$ awk 'BEGIN { print 3.141592653, 6.283185306, 12.566370612 }'
-| 3.14159 6.28319 12.5664
# Setting OFMT to a new format
$ awk 'BEGIN { OFMT="%6.3f"; print 3.141592653, 6.283185306, 12.566370612 }'
-| 3.142 6.283 12.566

We set the output format to %4.3f which specifies that:

  • A minimum of four digits are output for every number.
  • A maximum of three digits are output after a decimal point.

The recommended practice is to set OFS, ORS and OFMT in the BEGIN block before your input records start processing.

The printf function is a powerful tool that enables you to format output to match your exact requirements. For example, it is straight forward to output columns of certain widths, as well as formatting a number’s precision.

The syntax of printf is similar to print, except the first argument is always a format string:

printf format, item1, item2, ...
printf(format, item1, item2, ...)

The format string is a specification of what is to be output by printf. Within it you include format specifiers as well as regular text. The subsequent items after the format string are what will be plugged into the format specifiers.

Let’s take a look at the samples/prinft.awk example program in the GitHub repository:

The samples/print.awk script.

This program is responsible for displaying a list of files and their permissions in a readable way. Invoke it on the command-line by piping the output of an ls -l command:

Running the samples/print.awk program.

The printf function is used to format the output in this program.

  • format = "%20-s%s\n"
    A variable called format caches a format string so we can pass it to multiple invocations of printf. The format string itself contains two %s format specifiers. The first one includes a modifier, 20-, which specifies a minimum width of twenty characters for the string.
  • printf format, "Filename", "Permissions"
    printf format, "--------", "-----------"
    These calls to printf are executed once in the BEGIN block to output column headings. They receive the format variable in addition to two regular strings for its items.
  • printf format, $NF, $1
    This printf is invoked for each input record the program receives. We provide it the format variable so the column data lines up with the headings appropriately. We also specify $NF and $1 for the items which refer to the filename and permission strings in the input record.
  • printf "\n\tFile count: %i\n", count
    We call one more printf in the END block and pass it a count variable, which stores the total number of records the program output. The %i format specifier renders count as an integer.

Notice that printf doesn't append a newline character to the end of its output; it is therefore necessary to provide one explicitly.

In additional to %s and %i, many more format specifiers are available to use with printf. A format specifier always starts with the percent character (%) and ends with a format-control letter - the most useful of which are listed here:

Format specifiers available to the printf function.

Modifiers are also specified in between a format specifier’s % character and format-control letter. They can be used for things like controlling number precision, padding and widths. Modifiers are completely optional, but I’ll provide a list of some portable ones here for completeness:

Modifiers to tweak printf format specifiers.

Built-In Functions

The AWK programming language provides many built-in functions that can be called in your programs. Built-in functions fall into three categories: numeric, string, and I/O. For example, the samples/filemod.awk program uses the following functions:

Built-in functions usd in samples/filemod.awk.

Of particular interest here is the system(command) function, which enables us to pass a string representing a shell command that is subsequently executed. The filemod.awk program utilizes this function to invoke a mv command in order to rename files:

command = sprintf("mv %s %s\n", filename, target)
system(command)

The sprintf function firstly constructs and returns the command string before it is passed to system(). As you can imagine, the ability to dynamically construct arbitrary commands and execute them within a program is a powerful feature.

It is important to not leave a space between the function name and its parenthesis, as doing this could confuse the AWK interpreter depending on where the function is being called. I encourage you to check out the AWK programming manual to discover the many functions available to use in your programs.

In Conclusion

I hope that the content presented in this article has sparked an interest in the AWK programming language. With the fundamental concepts of how AWK works under your belt, I’m sure it will become a useful utility in your Linux toolbox for analyzing and filtering data on your system.

Keep an eye out for more AWK related articles coming from myself in the future. I also recommend a couple of resources to continue your learning:

MSc. Programmer and fan of open source software.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store