An introduction to the Command-Line interface

Too afraid to stare into the black abyss of your computer’s terminal? Do you want to learn to use it but have no clue where to start or why should you care about tools written decades ago? Don’t worry, we’ve got you covered with all that serious stuff.

This is a beginners course Pedro Tiago Martins and I wrote it originally for internal use in our research group. It’s intended to be completed over the course of a couple afternoons and assumes no previous knowledge of coding or experience with Unix systems. In this course we’ll cover:

  1. Why you should think that using the terminal is useful
  2. Basic system file manipulation
  3. Some of the most common programs you can use to manipulate data, including grep, sed or awk

While we don’t think this course is by any means exhaustive, it should at least give you a footing into the world of the terminal. The idea is that by the end of this course you should be able to start experimenting with the more complex functions of these programs.

Through the documment you’ll see sometimes exercises, as this course was intended as a hands-on learning session. While you could just read the whole thing, we recommend to keep a terminal window open to follow the exercises. We also encourage you to integrate these programs in your day-to-day workflow, as that’s the best way to learn to use them efficiently. The best way to learn is to do! Take a moment to consider how to use the information presented here.

Disclaimer: mind that this guide might contain oversimplifications for the sake of learnability. For example, ou will find certain terms being used (sometimes) interchangeably such as terminal, bash, command line, console, or shell. Technically, these all mean different things, but for our current purposes they can indeed be used interchangeably for the most part. Once you start feeling more comfortable with the tools we list here we encourage you to check the definitions.

First of all, what is the terminal exactly?

The terminal is a text-based interface to your computer, the same way a graphical interface is. Some things are better accomplished (or only possible) through the terminal. Some things would be really tedious in a graphical user interface. The terminal can be, in general, a much faster than interacting with your computer in other ways.

When I started by PhD I had already migrated to Linux some time ago, but I never used the terminal too much. Nowadays it’s an integral part of my work (in fact, I’m writting this in a text editor in the terminal called vim!), and it has facilitated many tasks over the years. Additionally, I’m able to keep a register of everything I do, and, for example, reuse old code to repeat otherwise cumbersome tasks now. Anyone can take advantage of the terminal, but knowing how to use it is a great skill to have for anyone who works professionally with data, code or text files.

Here are some examples of terminal programs (but of course there are many more!):

Two important reasons to learn to use it

Learning to use Unix shell tools and the basics of coding on the terminal has two inherent advantages that are shared with learning to code in general:

  • Check these 90 files
  • Delete only those that have “test” in the name.
  • Generate a table with the names of the files, alphabetically numbered.
  • Add the number of lines of each document in the table.

Of course, nothing forces you to do this in the terminal, but imagine the ammount of time it would take, and what would happen if, for example, a collaborator tells you that the files you are using are wrong and you have to do the task again. With a script you can do this in a couple of seconds, as many times as needed.

Reproducibility is important also whenever you are stuck with errors. If you share your code along with your problem, people will be able to reproduce the problem itself, and tell you where you need to fix it. Very often, when you google how to do something (regardless of tool), you get a solution that involves the terminal, sometimes without even mentioning it. You will start not only recognizing what the solution is about in this cases, but also apply it. For example, one time Pedro was helping editing a book in LaTeX, and he asked one of the people involved how to do something. The email he got in reply to this was the following, no more, no less:

sed -i s/.*\\emph.*// main.adx

One could argue that this is too short an email reply, but being acquainted with the terminal was enough to use the solution. ”That is a one liner - a very short program that solves a specific problem. We’ll speak about this specific program (sed) later on.

A note on operating systems

The terminal is an interface readily available in UNIX-like systems. What does that mean? If you use any flavor of Linux or MacOS, you’re set. You can skip right to the next section. If you use Windows, however, this is not the case. You will need some extra stuff to be able to use it, since Windows is not a UNIX-like system. Here are some things you can do in that case:

Enough talk

Let’s get to the real work!

Opening the terminal

The terminal is a program like any other. Look for it where you have other programs in your computer, and you’ll find it. In most Linux distributions there is also a default shortcut that opens it: ctrl + alt + t. Otherwise, you can just bring up the search bar and type in “Terminal”.

You will be greeted with a prompt, waiting for commands to be typed in. Depending on your system, you will see something like the name of your computer, username and your current location in the file structure of your computer. Typically, your initial location is your home folder, which is the same as your username. It is in this folder that other, familiar folders are located in your computer (Desktop, Downloads, etc.). Indeed, you are always “somewhere” in your computer where you’re in the terminal.

Your first basic commands

A command can do something on its own, or have arguments (which can be optional or mandatory).

An example of a command that has no arguments, that is, it’s just one term on its own, is pwd, which stands for “print working directory”. Here, as in most computing tasks, printing means “show on the screen”, and working directory is the directory you are currently in.

Practice: Type in pwd and press Enter. The result of this command will show up in the line below (in this case, pwd prints your current working directory). Another command is ls (short for ‘list’), which lists all the files in the directory you’re in. Type in ls and press Enter, and you will see all the folders and files listed below.

Here you will already see a change compared to your graphical interface (the way you usually navigate your computer). You will see that if a folder has too many files, your computer might become slower or even get unusable while trying to display and navigate through them all. This does not happen in the terminal.

Now let’s look at commands that require you to give them something to work with.

One such command is cd, for ‘change directory’. As you might have guessed, this command allows you to go to a different directory from your current one. In other words, that’s how you go from one folder to another. For example, try moving to one of your folders, like the Downloads one if you have one.

cd Downloads

This will make you go to your Downloads folder. If you just type cd, nothing will happen. You have to give it the directory you want to move into as an argument of the command (that is, cd Downloads). If you type in pwd now, you will see you are directory has indeed changed. Perhaps now it’s something like:

 ~/Downloads

~ refers to your home folder. So this means: You’re inside the Downloads folder, which is in your home folder. ~ is shorthand for whatever your home folder is because this might change (for example, you might change your username, which changes your home folder). cd ~/Downloads will therefore work regardless of this.

If you’re in Downloads and you want to go back one level you can type:

cd ..

Which means going back (or up) one level.

mkdir (short for ‘make directory’) creates a directory (or folder). This command also takes in an argument, which is the name of the folder you want to create.

Practice: Let’s try it. Type in:

mkdir something

And now you will have a folder called something in your working directory.

Practice: You might as get in that folder, and then go back. How would you do that?

By the way, this is a good time to tell you that you can get “auto-complete” if you press the tab key. If you type cd some and then tab, you will see it completes to something, the folder you just created. This is works for commands, folders and files.

mv moves a file. Similar to cd or mkdir in that, to use it, you only have to write mv followed by the file you want to move, but including as well the destination. For example, imagine you created a file called list.txt and you want to move it to the something folder you created before. That’d be:

mv list.txt something/

cp copies a file. It’s usage is similar to the previous example, except that it keep a copy of the file in the original place (just like CTRL + c, CTRL +x):

cp file folder/file

You might have noticed that the folder is distinguished from a file by / at the end of the name.

Exercise: To discover a new use of mv try to do mv on a file, but instead of ending the second argument (the folder’s name) on / , write something without it (as in mv lists.txt folder). What happened?

Let’s say you get into the something file and you want to delete your list.txt. To remove it, do:

rm list.txt

Mind that in the terminal there is no confirmation prompt for this action, and files aren’t moved to the trash - they just disappear, meaning you have to be extra careful of what you delete in there.

What if you want to remove a folder, like the one you created before, something/? Try it in your terminal. You sould see something like this:

rm: cannot remove 'something': Is a directory

The solution in cases like this is browsing the manual, perhaps the most usefool tool to learn to use any program in the terminal. Typing man and the name of the package you want to learn about opens the manual for that tool, including all the optional arguments of that tool.

Exercise: Enter man rm and browse the manual until you find the argument you have to provide to remove whole folders, and whatever is in them – a recursive removal. The structure is rm -letter folder.

So, to recapitulate:

Some easy commands to begin exploring data

Now we are going to check various ways of opening and manipulate files.

less is the kind of utility you need to peek at very big files that can’t be processed normally. less produces a look of a screen a time, requiring almost no processing power to show you part of a file. It’s a great tool for all those +1gb files that you might have.

Exercise: Create or get a text file of any kind and put it in the folder you happen to be in right now (Or navigate your way to the file if you want to practice that!). Let’s imagine that file is called myfile.txt. Do:

less myfile.txt

You will see however many lines fill up your screen, and you scroll down or press Enter to navigate your file. To exit less, press :q, short for command+quit. This will be your common way to exit terminal-based editors.

wc shows line, word and character count. You can specify one of these counts only with options, such as wc -l, which gives you only the line count. | Try it in your myfile.txt file!

head shows the first 10 lines of a file. With an option -n, you can define how many lines you want to see. Let’s say you have a file called file.txt with 10 lines: one, two, three and so on. Doing:

head -n 5 file.txt
    one
    two
    three
    four
    five

The command tail does the same, but with the last lines of a file, saving you some endless scrolling down.

Now imagine what you want is to compare two files for differences. diff does exactly that. To illustrate how it works, imagine you have two text files, file1.txt and file2.txt.

file1.txt contains:

    one
    two
    three
    four

file2.txt contains:

    one
    three
    five
    seven

If you do :

    diff file1.txt file2.txt

you will get the differences between the two. There are different ways of outputting the result of this comparison. Option -u, for example, is a useful one. Doing:

diff -u file1.txt file2.txt

will give you:

     one
    -two
     three
    -four
    +five
    +seven

Where lines with no symbol in front of them are in both files, lines with - are only in file1.txt, and lines with + are only in file2.txt.

Exercise: maybe you’d like to learn other output formatting options. Why don’t you check what the -y option does, for example?

comm also compares two files, but in a different way. It pays attention to the actual line (that is, not just what’s on the line, but which line is it), and gives you a column with the lines unique to file1.txt, lines unique to file2.txt, and lines common to both.

comm file1.txt file2.txt

will give you:

                    one
            three
            five
            seven
    two
    three
    four

Recapitulating:

Exercise: delete the first line of file2.txt using only tail and arguments. Feel free to google the solution, or use only man for an extra challenge.

There are many other programs that you could be using, but first we want to tell you a couple things that might ease your learning curve:

General notes about the terminal enviroment

Exercise: since you have already typed out some commands you can try this now in your terminal!

Exercise: remember when we gave you an example of an script early on? try creating some files with test in the name and others without it and deleting only the ones with test in their names from the terminal.
Question: what would sort *.txt do? Try it out, checking the manual page of sort before if you prefer.

Redirecting output

So far all these commands have produced output that is printed, if done correctly, in your terminal. But what if you want to have a file with the results of a command? That’s what the output redirection in for. You can redirect the printed output of any command with >. For example,

comm file1.txt file2.txt > result.txt

will generate a new file called result.txt with the results we showed you above. You also have the option of providing input for a command with < and appending the output of a command to a file with >>.

Exercise: Try appending the result of diff to the same result.txt file now, and examine it with less or head/tail

Exercise: Imagine you have to know how many files and folders are there in your desktop. How would you do it with two of the commands we have show you already? Generate a file with that list.

Pipes

Most of these utilities produce mutually intelligible outputs. That means you can concatenate their outputs, multiplying their efficiency, using |. This is one of the most useful features of the Unix shell.

Exercise: Try to understand what the following pipeline does. Use man if necessary:

 sort *.txt | uniq -c | sort -nr > all.txt

Question: Why did we use sort twice here?

Remember when we spoke about one-liners? This is how many of them are written: combining command-line programs with specific options in innovative ways that solve a very specific kind of problem.

If you know any programming language, think about how would you do the same task and compare it to this code. While it’s not always the case, some of these programs are way quicker and shorter than their equivalents in, for example, Python.

Recapitulating:

A step up: grep, sed, awk and using regex

So far we have seen various programs, but you’ll notice some of the most useful programs aren’t here. Here are three extremely powerful tools for text/data manipulation: grep, sedand awk. They all can profit from a formal string-searching language invented in the 50s called regex, so let’s start with that.

Regex

Regex stands for regular expression (as in an expression that describes a regular language in the Chomsky hierarchy sense, which is a piece of information that might help you or confuse you further). Using regex allows to capture patterns of characters instead of literal strings. Regex are very pervasive and you might cross them in some programming languages, such as Python.

Arguably, writing regex is not the most intuitive task out there, but with some practice one gets the hang of it. Take, for example, ^.\d*$, which, of course, would capture all lines starting (^) with any one character . followed by any number * of \digits (so, \d*), followed by a line end $.

Luckily for you, you only have to know a handful of these expressions for regular work, and there’s a number of cheatsheets out there to help you. The rigidity of Regex, however, is great for all sorts of problems. Let’s imagine you have the following file, called patterns.txt:

    1
    123
    123 123
    a
    abc
    abc 123
    a123

Question: Which of these lines would be captured by the regex we used as an example (^.\d*$)?

Let’s go over it again:

Let’s see what would be captured:

We can check this quickly on our file by using one of the most powerful Unix tools out there: grep. grep allows regex input, such as ^.\d*$, and can be used to capture all lines in our patterns.txt file matching these.

grep '^.\d*$' patterns.txt

The result should be:

1
123
a
a123

There are at least two syntax versions of Regex: one built for the Perl programming language, which is the one we have been using, and another called POSIX (which is arguably more transparent). You have to be careful with this when writing regular expressions.

To ensure that your version of grep uses Perl-like regex you can always use the -P option. Sometimes solutions will be easier to memorize in a particular Regex syntax, but this isn’t usually a problem, since you’ll probably not memorize the dozens of options available anyway.

If you want to dwelve into regex, a useful resource you could use is the awesome regexlearn webpage.

Grep

As we have introduced just before, grep is a great terminal utility used for finding strings in a file. Imagine having to perform a search through a series of .csv documents for a specific string, let’s say FOXP2. You could do something like grep FOXP2 *.csv. One of the beauties of grep is that it can also search by regular expressions, among many other options that make it an extremely useful tool.

Let’s check, for example, one of the most handy options: grep -f, or fgrep. fgrep takes as input a plain text documents and performs grep on the second argument files through each one of the input file lines. Imagine the following document with four gene names, called input:

    FOXP2
    FOXP1
    AMIGO
    EDAR

A line like fgrep input *.txt > output will produce a new file, called output, that is the result of all the instances where grep has found any of these terms in any of the working directory’s .txt files.

Exercise: check your history for all the times you have used rm (or some other terminal program). Exercise: how many times have you used each command so far?

Exercse: Try searching for “FOX” in a file with those contents with grep. What happened. Now try with grep -w. What does -w stand for and why should you remember this always?

Sed

Sed filters and processes text. Sed can, for example, substitute (s/) any given matched string by whatever you want in a large quantity of files without a sweat, and it’s its more common usage. Sed has many options, but regularly / delimits the fields and options. Sed has several flags, such as /g (global), which point at how many replacements should sed do (in this case, all possible).

Question: what does sed 's/freqeunt/frequent/g' *.txt do?

Exercise: think about an instance where sed would have saved you time in a task!

Exercise: Sed also accepts regex. How would you transform a comma sepparated .csv into a tab separated .tsv? Tip: to scape a character so that it’s not interpreted as a regular expression, you have to precede it by , like grep -P "stringfollowedby\?".

You can specify multiple orders with sed, like this:

sed 's/freqeunt/frequent/g ; s/typo/typpo/g' *.txt > resultfile.txt 

Sed is great, but the documentation in known to be a mess, so most people stick with the ’s/’ option, arguably the most useful one. However, if you are extra motivated, we encourage you to learn other uses. Particularly, we recommend this comprehensive guide. Ever wondered how to, for example, erase all the instances of a particular word in a text? I haven’t, but you can learn how there.

Awk

Awk is, well, awkfully ugly to write, specially compared to the modern standard of programming, but also a very powerful tool (and actually it’s its own programming language!). Awk is specialized on column-type data. Imagine you had a file like this and you wanted to get every item in column chrom:

#Snp chrom position feature
rs1892 12 1233455 Something
rs1980802 12 1233470 Somethingelse
rs1123213 20 2333455 Nothing

You could do an error-prone, hacky solution with grep, or import it to python and use pandas, but you’d be surprised how many times awk can be more convenient than those options.

Essentially, awk works this way: awk '(conditions, if any){any order}' inputfile. So, if you wanted to get columns two and three out of the example file provided above, you could like it this way:

awk '{print $2, $3}' examplefile

Note several things: what you want to do is enclosed in ', and then subcommands can be enclosed in {round brackets}. Awk is its own beast and we can’t cover it in extent (there are books for that), but you should know it is so useful that there’s a flavour of it specifically for genetics data, bioawk. As long as you data is more or less tidy, awk is still something worth learning. I’ve heard people swear by it and seen whole programs to do the same things that people use python for nowadays. You don’t have to get to that extreme, but there’s a reason some people do that: don’t underestimate awk.

As a full-fledged programming language, awk also has if conditionals, for example. Here’s a recent example I used in my work:

awk '{if ($3==$4) print $0, FILENAME}' *.csv

What this does is, it checkes if columns 3 and 4 have the same content in each file, and then print the whole row when the condition is met (that’s what $0 means} and the filename (literally, FILENAME). This is of course a bit trivial, but it can get way trickier than this, such as checking whether two columns are the same in two files and creating a new file with some of the contents of the first file when there’s a column match from the values of the second file (convoluted, yes, but a very common problem at which awk excels).

Ammong other things it can do, awk accepts regex as well. For example:

awk '/pattern/{ print $0 }' file.txt

Question: how would you get column 1 in a file like the one we showed at the beginning of this section without using awk, i. e. using tools that have been already explained here? Tip: you are allowed to check regex cheatsheets.

Exercise: Think about some of the tools you use to manipulate and extract data. How could you integrate awk in your day-to-day work? Why don’t you try to extract a column or two of interest to a new file in your prefered tool and compare it with what you just learned?

A last word: sudo

sudo is the basic command to assume the root or superuser role, meaning it gives you permission to do things that require certain privileges. It is implemented so that not anyone can do certain actions in a computer that might be harmful for it if you are not the administrator. When something requires this role, you will be asked for you password. Type it in and press Enter (note: you will not see the password or asterisks show up in your screen).

You might be shown an error or “permission denied” message when trying certain operations. Try them out with sudo and you should be ok. For example, depending on where you are in your filesystem, you might only be able to create new directories by doing:

sudo mkdir myfolder

and then typing your password.

Installing new programs:

The easiest way to install new terminal programs is by using a package manager. Package managers allow you to keep your system up-to-date by upgrading existing software, as well as install new software.

Linux

Different Linux distributions come with different package managers. A common one is apt, which comes with Debian-based distributions, such as Ubuntu, a popular distribution for Linux beginners. There are other package managers out there such as snap that we won’t cover here, but that work similarly.

To keep your system up-to-date, you do:

sudo apt-get update

This updates the sources from which the software can be upgraded. It signals your computer “here’s how you get the new stuff!”.

You can then do:

sudo apt-get upgrade

And this will actually upgrade the programs you currently have installed.

To install a new package, you just need the name of the program you want to install (for example, pandoc) and do:

sudo apt-get install pandoc

And after installation you will be able to use pandoc commands. If you do exactly as instructed and you get a command not found message, chances are you simply need to install that program.

MacOs

MacOS does not come with a package manager pre-installed, because the large majority of Mac users does not use the terminal (the same cannot be said of Linux users). A popular one is Homebrew. Go to the Homebrew website and follow the installation instructions, which amount to pasting a line of code into the terminal.

The process is similar to what we explained before, but with some differences. The first one is that you do not need special permissions (so no sudo)

The update your system, do:

brew update

and then:

brew upgrade

To install a program (say, pandoc) do:

brew install pandoc.

Connection to remote machines

Sometimes, especially if you work with large amounts of data or perform heavy computations, you need to connect to a remote machine. A common way to do so is by using the ssh protocol (short for secure shell), which is most often used through the terminal. To do so, you need to know the IP address, username and password of the computer you are going to connect to (if you need to connect to a remote machine to do your work, you were probably provided this information). If the remote machin’s username is user and the IP address is 192.168.2.100, all you need to do is:

ssh username@192.168.2.100

You will the be prompted to type in the password, and you will be in. Once in, you will have access to the files and programs that user would have locally, as if you had opened a terminal in that remote machine. The things you’ve learned so far in this course should give you a leg up should you find yourself in this situation.

Some people even use computers that have no screen attached to them, and access them by connecting remotely through ssh. The Raspberry Pi, for example, is a popular example of this. Some people have one hiding away somewhere in their home, doing something cool, undisturbed, connected to the internet but with no screen, keyboard or mouse attached. Whenever they need to check up on it, they just “ssh into it” from another computer in the house or the internet.

The end

Congrats, you have finished our quick course! Here’s your prize: an ASCII train

If you know your basics of shell, maybe you’d enjoy my other post on tools and good practices in bash I've been experimenting with lately, and if you want to learn LaTex, Pedro has an amazing guide essential for every beginner (I personally learned LaTex almost entirely using it!).


We hope this course was useful for you! Did you find an error, an annoying typo or want to provide feedback? Please drop me a mail at sanyi [DOT] personal [AT] google’s main mail service