Using AWK to Ease your CSV Manipulation
We often hear that we should use the right tool for the right job. The problem for developers is becoming aware of those different tools. So I want to save you some time and introduce you to an old tool that is good for simple processing of CSV files. The language is so simple that an experienced programmer can pick it up in an afternoon.
Awk is a text processing utility that happens to be a programming language. It was created back in the 70s by Aho, Weinberger, and Kerningham, hence its name. Awk was probably most popular during the 80s until Perl, strongly inspired by Awk, replaced it.
So is Awk obsolete? It is obsolete as a general purpose language. But when used for text processing, such as when we work with CSV files, it is good tool to have around.
Awk’s sweet spot is tasks that would be too cumbersome to do in excel but are not complex enough to justifying using a language like Python.
Awk was designed for text manipulation. Awk automatically will cycle through every line in a file, meaning that you don’t need to write any boilerplate code for opening the file and iterating through it. It also automatically breaks the lines into fields. It has built-in variables for commonly used values in text processing such as a delimiter variable and an end of record. It also has some string manipulation and mathematical functions.
Awk assumes that one is going to do text manipulation, so it has three common stages built into the execution of a program: pre-processing, processing, and post processing. If we need to process each line conditionally by text pattern, then one can use regular expressions as event triggers followed by a block of code. When current line matches the regular expression, then the following block of code will run.
Awk works the best in linux environments, but one can run it in Windows either as a stand-alone program or through a Cygwin installation. All examples shown on this article were developed using a Windows machine running Cygwin.
Say that you need to transform gps coordinates, which are stored in Degrees Minutes Decimal format into Decimal format and associate them with a table of users. I will walk you through this task from the creation of test data, to the transformation of the data, and then to writing a final script that will create a .sql file that we can then run to update a database. You can find the code from the examples at https://github.com/Zekiah/awk_for_csv
Generating random test data
While the example that prompted this article was based on real data, I can’t share that data, so I need to create fake data. My first task will be to create the a column with a fake primary key and then four more columns with random integer data. For this I need to generate integers that we can use as fake primary keys. Here is a script that will create two columns of random real numbers.
function random(shift) {
return rand() * (10 ** shift);
}
BEGIN {
for(i = 0; i < rows; i++) {
for(j = 0; j < columns; j++){
if (j + 1 >= columns){
printf "%dn", random(4);
} else {
printf "%d,", random(4);
}
}
}
}
We run it by entering in the shell:
gawk -v rows=300 -v columns=5 -f
create_test_data_simple.awk > test_data.txt
Let me briefly explain some features of the code. First, Awk has a c-style syntax, so most programmers should find it easy to follow what is happening. I am first creating a function call random() that will create a random real number. Since rand() creates a random number between 0 and 1, the function shifts the decimal point a few places if that is what we want to see for our test data.
We then find the BEGIN block. BEGIN is the pre-processing event. This block will run before Awk starts processing a file line by line. Since this script is not going to process an input file, all of our code is going to be located here.
The next part of the code includes two loops. These create the rows and columns that we need for our test columns. I am using printf so that I can loop through the column creation. I am guarding for the last column. Once we reach the last column, it will print a new line character. Notice that variables have no types. Awk is a dynamic language and it will convert a value from a number to a string depending on the context. This makes working with values easier since there is no need to explicitly cast a value back and forth.
Awk has two printing functions: print() and printf(). print() will automatically add a new line to the output; printf will not. Since I want to print a row through a loop, I need to use printf. printf() more or less works the way printf() works in C, so the formatting that one learns in C will mostly work in Awk as well. So %d will automatically print the number as a integer even though random() generates real numbers.
Notice that there is no place where I am assigning values for the variables digits, rows, or columns. I am assigning those values when I run the script from the command line using the -v option. We need the -f option to tell awk to read the ‘create_test_data_simple.awk’ file as source of the script.
Adding test coordinate columns to our test file
Now let’s add random coordinate numbers to test_data.txt. What we want to do is add the fake gps-styled coordinates.
function flip(){
return int(rand() * 2);
}
function randomDegree(range, axis){
direction["lat"][1] = "N";
direction["lat"][-1] = "S";
direction["lon"][1] = "E";
direction["lon"][-1] = "W";
degree = int(rand() * range);
decimal = rand();
sign = 1;
if (flip()){
sign = -1;
}
switch (format){
case "dmd":
result = degree " " (decimal * 60) "" direction[axis][sign];
break;
default:
result = sign * degree + decimal;
}
return result;
}
function randomLat(){
return randomDegree(90, "lat");
}
function randomLon(){
return randomDegree(180, "lon");
}
{
print $0 "," randomLat() "," randomLon();
}
And we run it with the following command:
gawk -v format=dmd -f create_geo_data.awk test_data.txt > geo_test.csv
randomDegree() is the heart of the script. I start by creating an array that will hold cardinal points for the desired random Degree. The next part of the code creates the necessary components for a random decimal coordinate: the degrees, the decimal part, and the sign. Then we use a switch block checking if the user wants it in the Degrees Minutes Degrees format. If that is the case then we print the degrees, we get the minutes and decimal components, and add the cardinal points.
At the very end I have the processing block. This block will execute with every line that is read from the file. I first print the full line by using the $0 variable. Then I am appending a randomLat() and randomLon() to the line. This way we are building a test csv file with a primary key, fake data columns, and fake geographical data.
Converting Coordinates from Degrees Minutes Decimal to Decimal
By this point I have replicated a version of the data I originally had. Now we can face the task that made me use awk in the first place: transforming GPScoordinates into decimals. I am also going to append these columns to the current rows. The general script is below:
function decompose(value, result){
size = length(value);
result["numeric"] = substr(value, 1, (size - 1));
result["direction"] = substr(value, size, 1);
}
BEGIN {
FPAT = "([^,]+)|("[^"]+")"; #This is the pattern to use for CSV fields
sign["N"] = "";
sign["S"] = "-";
sign["E"] = "";
sign["W"] = "-";
}
{
#adjust the lat/lon fields from your worksheet here
split($lat, lat_raw, " ");
split($lon, lon_raw, " ");
decompose(lat_raw[2], lat_dec);
decompose(lon_raw[2], lon_dec);
lat_decimal = lat_dec["numeric"] / 60;
lon_decimal = lon_dec["numeric"] / 60;
lat_sign = lat_dec["direction"];
lon_sign = lon_dec["direction"];
#adjust how many characters to grab according to your data
print $0 "," sign[lat_sign] lat_raw[1] + lat_decimal "," sign[lon_sign] lon_raw[1] + lon_decimal;
}
And we run it by running the following command on the shell:
gawk -v lat=6 -v lon=7 -f to_decimal.awk geo_test.csv > converted.csv
The Degree Minutes Decimal format is problematic. A value in this format can look like this: “45 23.4354W”. Breaking the format up, we have two numeric components and then we have a character at the end of the second. We first break up the Lat/Lon value from the decimal and direction string. We then pass the numeric part with the character to a function that breaks up the string into its numeric and direction components. I needed both components back, so I am passing an array. All arrays in awk are associative arrays; they all hold key-values pairs. When using it as an array it will automatically add numeric keys. Arrays are always passed by reference, so by setting the two values I am returning two values at once.
In the BEGIN clause we are using a built-in variable found in gawk called FPAT. This variable allows us to break up fields by pattern. This is important when processing CSV data that has fields with text that contains commas. The pattern, taken from gawk’s manual, respects the double quotes escaping of CSV. After we set the FPAT variable, I have set an array that will hold the sign values for the eventual decimal number.
In the processing block we are doing a lot on every line. Every time we get a line, we first split the field. Here we are using the split() built-in function. Split takes the field variable, an array, and the character used to split the string into an array. Here I am declaring lat and lon value at the command line, which are meant to enter the column number which holds the coordinate values. By adding the $ sign in front of the variables, we can call the automatic column variables that are created by awk. Notice how I am declaring the variable array as I execute the split() function. Arrays are created when they are first used.
Then I do the necessary calculations to get the decimals for lat and lon by using the numeric component from the arrays that collected those values through the decompose() function. I then get the sign. Finally I print the full line followed by the new converted columns.
Quickly writing a sql update file
Finally we get ready to update the values into a table. This script is very simple.
And we run this code with the following command:
gawk -f update.awk test_data.txt > update.sql
This time we are using the BEGIN block to set the default variable FS, Field Separator. When we know that the data is simple, that it contains no string, we can use FS as a simple field separator.
Then we print the update statement. Once we run this script, we have a sql script that we can run on our sql database. Of all of the scripts that I have shared so far, this is the one that I find to be the most useful if we use it as a pattern. Technically we don’t need it as script; we could have written it as a single command. But I like the documentation that a script brings with it; especially when one is busy and needs to create a similar script in the future and we would like to find an example for it.
I hope that by know you can see how Awk can be useful for the right tasks. It is an easy language to use, and it makes some text processing tasks easier to perform.
This post was written by:
Hugo
Senior Engineer
Contact us for more information on this post, the techniques discussed here, or Zekiah’s system integration and software development services.