{ datagubbe }


datagubbe.se » cut vs. awk

cut vs. awk


The UNIX philosophy

In UNIX, there is (or used to be) a philosophy which, among other things, stipulates that a piece of software should do "one thing well". There is of course a long-running debate over not only what this really means, but also about whether certain programs - if any at all - still adhere to this, or ever did.

So, we're allegedly in the middle of a dangerous downward spiral of constant feature creep and a growing number of command parameters that will, eventually, lead to a Grand Unified Program that does everything poorly (Depending on your personal preference and current mood, replace the GUP with systemd, Emacs, Windows or Google Chrome as you see fit).

I for one find the concept of the UNIX philosophy of doing one thing well amicable. I also realize that it is a product of a time when computers still mostly passed around text files, and that those text files were considered preposterously huge if they exceeded a couple of megs in size. Heck, the modern GUI was still a research project at PARC when Doug McIlroy wrote the UNIX philosophy down in 1976.

Thinkin' 'bout philosophy

So, what does it mean to do one thing and do it well? Well, awk has been a staple of any UNIX diet since the late 1970s. I think it's a lovely little tool which I use even for rather mundane tasks. It's certainly changed a bit over the years, but the core concept of the language remains the same. Still, it's a complete programming language and can do a lot more than a simple, single-purpose command.

The cut command is slightly newer, but like awk, it's a part of the POSIX standard. It can also hardly be considered to have suffered much feature creep: it's got a rather stringent set of parameters and really does just one thing, which is cutting a smaller piece of text out of a larger piece of text.

The question is of course, does the UNIX philosophy still hold up? Is it always better to have one small program doing one thing well, as opposed to a slightly bigger program doing many things? Let's examine this by performing a simple task with two slightly different twists.

Task number one

Task number one was to print the second column of a three-column file with 10,000 rows, with a single space separating the columns. This task was repeated 100 times.

cut (GNU coreutils, 43224 bytes)

cat cva_in1 | cut -d ' ' -f 2 > /dev/null

real    0m0.297s
user    0m0.288s
sys     0m0.153s

awk (gawk, 658072 bytes)

cat cva_in1 | awk '{print $2}' > /dev/null

real    0m0.818s
user    0m0.754s
sys     0m0.232s

Python 3 (Python 3.6, 4526456 bytes + libraries)
My go-to language for farting around with programming ideas, thus included for reference. The below two lines of code were written to a script through which the input was piped and, just as above, redirected to /dev/null:

import sys
for l in sys.stdin: print(l.split()[1])

real    0m5.423s
user    0m4.753s
sys     0m0.873s

As is clearly evident from this highly scientific performance test, cut is by far the fastest tool for the job. I tried a few different approaches with Python (such as opening the file from the script instead of piping it), but they were all similarly slow.

Task number two

This time, things were spiced up by printing the fifth column in a file with unevenly-spaced columns, using the output from "ls -l". The file used for testing contained 2235 lines. This task was also repeated 100 times. Since cut doesn't quite cope with this uneven spacing, we had to put the magic of the UNIX philosophy to work using pipes: this time, tr was involved as well, stripping away spaces to make each row understandable to cut.

cut + tr (GNU coreutils, 47288 bytes)

cat cva_in2 | tr -s ' ' | cut -d ' ' -f 5 > /dev/null

real    0m0.212s
user    0m0.371s
sys     0m0.138s

awk

cat cva_in2 | awk '{print $5}' > /dev/null

real    0m0.460s
user    0m0.379s
sys     0m0.225s

Python 3

import sys
for l in sys.stdin: print(l.split()[4])

real    0m4.450s
user    0m3.693s
sys     0m0.934s

Despite the extra pipe, it seems the UNIX philosophy comes out on top this time as well.

A few notes

This is of course not an exhaustive test and proves little, if anything at all, about he UNIX philosophy. I just wanted to perform a simple test with a scenario that at least I have come across many different times in many different scripts and many different languages.

In a typical daily scenario, I would personally have picked awk for task two and saved myself a bit of typing and figuring. This would have made no significant impact on perceived performance, because I usually don't deal with very large files and in most cases, the time I spend with tools I'm less comfortable with far exceeds the time I then spend waiting for the script. Still: based on nothing but gut feeling, I actually expected awk to be faster or at least comparable for task two. So much for my gut.

For coping with more complex scenarios without resorting to a language proper, more pipes will have to be added and more command line utilities will have to be invoked if we want to keep on using tools compliant with the UNIX philosophy. This can likely lead to worse performance, but this test just goes to show (for the umpteenth time) that the tool we instinctively reach for to tackle a task might not be the right one.