Awk command line or script that helps you sort text files (recommended)

Awk command line or script that helps you sort text files (recommended)

Awk is a powerful tool that can perform some tasks that might be accomplished by other common utilities, including sort.

Awk is a ubiquitous Unix command for scanning and processing text containing predictable patterns. However, since it has functional capabilities, it can also be reasonably called a programming language.

Confusingly, there is more than one awk. (Or, if you think there's only one, then the others are clones.) There's awk (the original program written by Aho, Weinberger, and Kernighan), and then there's nawk, mawk, and the GNU version of gawk. The GNU version of awk is a highly portable free software version of the utility that has several unique features, so this article is about GNU awk.

Although its official name is gawk, on GNU+Linux systems, it is aliased as awk and used as the default version of the command. On other systems that do not come with GNU awk, you must install it first and call it gawk instead of awk. This article uses the terms awk and gawk interchangeably.

awk is both a command language and a programming language, which makes it a powerful tool for tasks that are otherwise reserved for sort, cut, uniq, and other common utilities. Fortunately, there's plenty of room for redundancy in open source, so if you're faced with the question of whether to use awk, the answer is probably a resounding "whatever."

The beauty of awk's flexibility is that if you've decided to use awk to accomplish a task, you can continue to use awk no matter what happens next. This includes the everlasting need to sort your data rather than in the order in which it is delivered to you.

Sample Dataset

Before exploring awk's sorting methods, generate a sample data set to work with. Keep it simple so you don’t get bogged down by edge cases and unexpected complexities. This is the sample set used in this article:

Aptenodytes;forsteri;Miller,JF;1778;Emperor
Pygoscelis;papua;Wagler;1832;Gentoo
Eudyptula;minor;Bonaparte;1867;Little Blue
Spheniscus;demersus;Brisson;1760;African
Megadyptes;antipodes;Milne-Edwards;1880;Yellow-eyed
Eudyptes;chrysocome;Viellot;1816;Sothern Rockhopper
Torvaldis;linux;Ewing,L;1996;Tux

This is a small dataset, but it provides a variety of data types:

  • Genus and species names, related but separate
  • Surname, sometimes initials preceded by a comma
  • An integer representing a date
  • Arbitrary term
  • All fields are separated by semicolons.

Depending on your educational background, you might think of this as a two-dimensional array or table, or just a row-delimited collection of data. How you view it is your problem; awk only recognizes text. It's up to you to tell awk how you want it to be parsed.

Just want to sort

If you only want to sort a text dataset by a specific definable field (like a "cell" in a spreadsheet), you can use the sort command.

Fields and Records

Regardless of the format of the input, you must find patterns in it so you can focus on the parts of the data that are important to you. In this example, the data is delimited by two factors: rows and fields. Each row represents a new record, just like you would see in a spreadsheet or database dump. Within each row, there are different fields (think of them like cells in a spreadsheet) separated by semicolons (;).

Awk processes only one record at a time, so when you construct the instructions sent to awk, you can focus on just one line of record. Write down what you want to do with a line of data, then test it on the next line (either mentally or with awk), then do some other tests. Finally, you need to make some assumptions about the data your awk script will process so that it can provide you with the data in the structure you want.

In this example, it's easy to see that each field is separated by a semicolon. For simplicity, assume that you want to sort the list by the first field in each row.

Before you can do the sorting, you have to be able to get awk to focus only on the first field of each line, so that's the first step. The syntax of the awk command in the terminal is awk, followed by relevant options, and finally the data file to be processed.

$ awk --field-separator=";" '{print $1;}' penguins.list
Aptenodytes
Pygoscelis
Eudyptula
Spheniscus
Megadyptes
Eudyptes
Torvaldis

Because the field separator is a character that has special meaning to the Bash shell, you must enclose the semicolon in quotes or precede it with a backslash. This command is only used to prove that you can specialize in a specific field. You can try the same command using another field number to see the contents of another "column" of data:

$ awk --field-separator=";" '{print $3;}' penguins.list
Miller, J.F.
Wagler
Bonaparte
Brisson
Milne-Edwards
Viellot
Ewing, L

We haven't done any sorting yet, but this is a good foundation.

Scripting

awk is more than just a command, it is a programming language with indexing, arrays, and functions. This is important because it means you can take a list of fields to sort, store the list in memory, do your processing, and then print the resulting data. For a complex series of operations such as this, it is easier to do them in a text file, so create a new file called sort.awk and enter the following text:

#!/bin/gawk -f
BEGIN {
    FS=";";
}

This will build the file into an awk script containing the lines that are executed.

The BEGIN statement is a special setup function provided by awk for tasks that need to be performed only once. Defines the built-in variable FS, which stands for field separator and is the same value you set with --field-separator in the awk command. It only needs to be done once, so it is included in the BEGIN statement.

Arrays in awk

You already know how to collect the value of a specific field by using the $ symbol and the field number, but in this case, you want to store it in an array instead of printing it to the terminal. This is done with awk arrays. The important thing about an awk array is that it contains keys and values. Imagine the content of this article; it would look like this: author:"seth",title:"How to sort with awk",length:1200. Elements such as author, title, and length are keys, and the content that follows is the value.

The benefit of doing this in the context of sorting is that you can assign any field as the key and any record as the value, and then use the built-in awk function asorti() (sort by index) to sort by the keys. Now, let's assume for a moment that you only want to sort by the second field.

An awk statement not enclosed by the special keywords BEGIN or END is a loop that is executed for each record. This is the part of the script that scans the data for patterns and processes accordingly. Each time awk turns its attention to a record, the statements within {} are executed (unless preceded by BEGIN or END).

To add keys and values ​​to an array, create a variable that contains the array (in this example script, I'll call it ARRAY, which isn't very original, but it's clear), then assign to it the key in square brackets and the value with an equal sign (=).

{ # dump each field into an array
  ARRAY[$2] = $R;
}

In this statement, the contents of the second field ($2) are used as the key, and the current record ($R) is used as the value.

asorti() function

In addition to arrays, awk has some basic functions that you can use as quick and easy solutions for common tasks. One of the functions introduced in GNU awk, asorti(), provides the capability to sort an array by key (index) or value.

You can only sort the array after it has been populated, which means that this operation cannot be triggered for each new record, but only at the very end of the script. For this purpose, awk provides the special END keyword. Contrary to BEGIN, the END statement fires only once, after all records have been scanned.

Add these to your script:

END {
  asorti(ARRAY,SARRAY);
  # get length
  j = length(SARRAY);
  
  for (i = 1; i <= j; i++) {
    printf("%s %s\n", SARRAY[i],ARRAY[SARRAY[i]])
  }
}

The asorti() function takes the contents of ARRAY, sorts them by index, and puts the results into a new array called SARRAY (an arbitrary name I invented for this article that stands for "sorted ARRAY").

Next, the variable j (another arbitrary name) is assigned to the result of the length() function, which counts the number of items in SARRAY.

Finally, a for loop is used to iterate through each item in SARRAY using the printf() function to print each key and then print the corresponding value for that key in ARRAY.

Run the script

To run your awk script, make it executable:

$ chmod +x sorter.awk

Then run it against the penguin.list example data:

$ ./sorter.awk penguins.list
antipodes Megadyptes;antipodes;Milne-Edwards;1880;Yellow-eyed
chrysocome Eudyptes;chrysocome;Viellot;1816;Sothern Rockhopper
demersus Spheniscus;demersus;Brisson;1760;African
forsteri Aptenodytes;forsteri;Miller,JF;1778;Emperor
linux Torvaldis;linux;Ewing,L;1996;Tux
minor Eudyptula;minor;Bonaparte;1867;Little Blue
papua Pygoscelis;papua;Wagler;1832;Gentoo

As you can see, the data is sorted by the second field.

This is a bit limiting. It would be nice to have the flexibility to choose at runtime which field to use as the sort key so that I could use this script on any dataset and get meaningful results.

Add command options

You can add command variables to an awk script by using the literal value var in the script. Change the script so that the iteration clause uses var when creating the array:

{ # dump each field into an array
  ARRAY[$var] = $R;
}

Try running the script so it is sorted by the third field using the -v var option when executing the script:

$ ./sorter.awk -v var=3 penguins.list
Bonaparte Eudyptula;minor;Bonaparte;1867;Little Blue
Brisson Spheniscus;demersus;Brisson;1760;African
Ewing,L Torvaldis;linux;Ewing,L;1996;Tux
Miller,JF Aptenodytes;forsteri;Miller,JF;1778;Emperor
Milne-Edwards Megadyptes;antipodes;Milne-Edwards;1880;Yellow-eyed
Viellot Eudyptes;chrysocome;Viellot;1816;Sothern Rockhopper
Wagler Pygoscelis;papua;Wagler;1832;Gentoo

Revision

This article demonstrates how to sort data in pure GNU awk. You can improve the script so that it is useful to you, spend some time studying the awk functions in the gawk man page and customize the script to get better output.

Here is the complete script so far:

#!/usr/bin/awk -f
# GPLv3 appears here
# usage: ./sorter.awk -v var=NUM FILE
BEGIN { FS=";"; }
{ # dump each field into an array
  ARRAY[$var] = $R;
}
END {
  asorti(ARRAY,SARRAY);
  # get length
  j = length(SARRAY);
  
  for (i = 1; i <= j; i++) {
    printf("%s %s\n", SARRAY[i],ARRAY[SARRAY[i]])
  }
}

Summarize

The above is the Awk command line or script that I introduced to you to help you sort text files. I hope it will be helpful to you. If you have any questions, please leave me a message and I will reply to you in time. I would also like to thank everyone for their support of the 123WORDPRESS.COM website!
If you find this article helpful, please feel free to reprint it and please indicate the source. Thank you!

You may also be interested in:
  • How to port awk scripts to Python
  • Shell script regular expressions, grep, sed, awk
  • Usage of awk command in Shell script
  • In the Linux Shell script, an array is passed to awk for processing
  • Linux awk time calculation script and awk command detailed explanation

<<:  Examples of 4 methods for inserting large amounts of data in MySQL

>>:  WeChat applet implements a simple handwritten signature component

Recommend

MySql fuzzy query json keyword retrieval solution example

Table of contents Preface Option 1: Option 2: Opt...

CSS example code to hide the scroll bar and scroll the content

Preface When the HTML structure of a page contain...

Some "pitfalls" of MySQL database upgrade

For commercial databases, database upgrade is a h...

A brief analysis of SQL examples for finding uncommitted transactions in MySQL

A long time ago, I summarized a blog post titled ...

Detailed steps for Linux account file control management

In the Linux system, in addition to various accou...

JavaScript Prototype Details

Table of contents 1. Overview 1.1 What is a proto...

React implements import and export of Excel files

Table of contents Presentation Layer Business Lay...

Zen coding for editplus example code description

For example, he enters: XML/HTML Code div#page>...

Question about custom attributes of html tags

In previous development, we used the default attr...

Tutorial on installing and using virtualenv in Deepin

virtualenv is a tool for creating isolated Python...

Apply provide and inject to refresh Vue page method

Table of contents Method 1: Call the function dir...

Detailed explanation of the use of Linux time command

1. Command Introduction time is used to count the...

MySQL database master-slave configuration tutorial under Windows

The detailed process of configuring the MySQL dat...