Data exploration and manipulation tips

Programs like grep, sed and awk have been around for decades, and they sure have survived the test of time: some of these tools are significantly faster than some of the fancy new data science programs out there (as long as your data is not weirdly formated, at least). It should be a surprise that anyone manipulating data still uses them, because unix tools are a wonder.

I’ve used programs like sed and awk more and more as my PhD progresses, but there’s always room for improvement. Here are a couple methods I’ve been experimenting with to make my data exploration and manipulation in the command line more efficient.

Here I’m asumming you know a bit about these unix tools. If you don’t, don’t worry: there are a lot of tutorials out there that could help you, and I’m preparing an introduction for a future post reworking a beginners course that Pedro Tiago Martins designed a while ago.

Ctrl x + Ctrl e command editing

I used Sublime Text for a long time. While I still think it’s a great program and sometimes still resort to it, I’ve been trying to learn some VIM, in part because I’m a nerd, but also because there’s inherent value to it in terms of productivity and efficiency when you master it.

Here’s a useful tip I learned recently: if you type Ctrl x + Ctrl e while in the command line you can edit whatever command you are writting at the moment. That’s a big deal sometimes when you have a typo in the middle of a very long pipeline in any editor, but with VIM ’s capacity to move around text in two keystrokes it becomes an impressive time saver.

Using the ‘pipeline’ command

If you are anything like me, you probable write your pipelines in your console only to find the latest error, and edit the faulty commands until the desired output is printed. That’s a lot of going back to previous commands and enter.

I recently found out about pipeline, a utility that prints a live, less-like output of your pipeline as you write. My only problem with this is that the approach is not compatible with the cntl x + cntl e approach to writing pipes I mentioned above.

Here’s an example of usage as shown in the command’s Github page: example

Custom aliases and functions for basic (but wordy) operations

This is not new at all for those of you who have a better command of shell, but I recently started applying it more thoroughly and I’ve found it very useful.

Over the time I’ve found myself writing the same basic operations many times. In some cases it’s things like sort file | uniq (the common operation of sorting a file and printing only non-duplicated lines). This kind of cases are not a big deal to write. However, I like to use a lot of awk, specially in combination with things like sed. And awk can be extremely wordy.

Take a basic operation like printing the first column of a file: awk '{print $1}' filename Those opening curly brackets and apostrophes are my bane – I used to find myself constantly forgetting any of those. Of course, this is a bit of a silly example, but you can make it as complex as you need (and awk does have a way to be more complicated). However, it’s often the case that I find myself wanting to check a particular column to, for example, see how many elements are duplicated, and I used to run into this small nuisance often.

The solution is to create a custom function for cases like this stored in you .local/bin/ folder as a script with a #!/usr/bin/env bash shebang (many shebangs work, but this one ensures cross-compatibility apparently).

Storing it there assures that you can call the function anywhere in the system. Another option, for the shorter commands or functions, is to keep it like an alias in your .bashrc file (for example, I use alias killall='jobs -p | xargs kill' to, well, kill all the running jobs).

So, for the previous case, a custom function for the awk single column problem is as simple as keeping there a file called awkcol, for example (from awk+column) that looks like this:

#!/usr/bin/env bash

COLN=$1
awk -v co="$COLN" '{print $co}' "$2"

See it here in action:

Personally I do is the following: the terminal keeps a history of everything, so from time to time I check it to see if there’s anything I could be making easier. Things I’ve slowly integrated into my workflow are:

function fullgit() {
    git add .
    git status
    git commit -a -m "$1"
    git push
}

There’s a caveat regarding reproducibility: Of course all these things are going to mess up with your code’s reproducibility if you include these custom functions in something you want to publish.

Extra note: I picked this tip up from the Data Science at the command line book – I found it a really nice resource and it has inspired a good part of this post. Do check it out.

Other resources you might be interested in (some of which I haven’t explored yet)


I hope some of these were useful to you! Got suggestions, found an error or want to comment on something? Please drop me a mail at munoz.andirko@ub.edu