8 Bash scripts
8.1 Writing shell scripts
Shell scripts are small programs written in the language of the shell; they’re convenient for tying together a series of commands that you might otherwise type by hand into a repeatable and documented set of operations. Bash scripts also allow us to write new commands or programs that can be used in a manner similar to other “built-in” programs we’ve been using.
8.2 Bash scripts are picky
One annoying feature of bash scripting is that it’s particularly picky about white space around commands and arguments, because essentially it tries to execute each line as if it was written at the command line itself (see). When following the examples below I recommend you be careful to follow the spacing I’ve used to avoid hard to diagnose syntax errors. For this reason, bash is not my preferred scripting environment. However it’s so ubiquitous in bioinformatics that it’s worth learning the basics of bash.
8.3 Simple script examples
8.3.1 Simple shell script 01
Let’s start with the simplest possible shell script – one that simply writes a message to stdout.
Put the following code into a file named tut01.sh:
# tut01.sh
echo "Hello, shell world!"Now you can call your shell script as so from the command line (assuming it’s in your current working directory)
bash tut01.shThis one line script simply called the echo command line tool to write text to stdout.
8.3.2 Simple shell script 02
That’s a pretty boring shell script. Let’s create a new version that takes as input an argument telling it what to print after the initial greeting. $1 is how we designate the first argument from the command line.
# tut02.sh
echo "Hello, $1"This expects an argument from the command line, so run this script as follows, substituting Paul as appropriate:
bash tut02.sh Paul8.3.3 Simple shell script 03
Shell scripts can create variables to hold the output of commands. This is illustrated below in which we create a variable input_reversed to hold the value of a computation done on the first argument to the script:
# tut03.sh
# note that rev command reverses lines of input
input_reversed=$(echo $1 | rev)
echo "$1 reversed is $input_reversed"And we again run it as:
bash tut03.sh "Twas brillig and the slithey toves..."Notice that I wrapped the text in quotes. What happens if you don’t include the quotes?
8.3.4 Simple shell script 04
We can get multiple arguments from the command line:
# tut04.sh
echo "The first argument was: $1"
echo "The second argument was: $2"Or we can check the number of arguments:
# tut04a.sh
nargs=$#
if [[ $nargs -ne 2 ]]
then
echo "I need two arguments!"
exit
fi
echo "The first argument was: $1"
echo "The second argument was: $2"8.4 Different ways of using Bash scripts
8.4.1 Reproducibility
Did your analysis require a bunch of different steps? Could you reproduce what you did? Could you teach someone else to do it?
Put each of the steps of your analysis into a bash script and save the file with a
.shending. You or someone else can replicate your analysis by runningbash yourscriptname.sh. Consider saving all such scripts you create for reproducibility purposes to a git repository.
Example:
# yeast_feature_counts.sh
#
# Purpose: count feature types in yeast genome annotation GFF file
# Output: a CSV table sorted by freq of feature
wget -q https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/reference/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz
gunzip GCF_000146045.2_R64_genomic.gff.gz
awk -F"\t" -v OFS="," \
'NF == 9 {cts[$3] += 1}
END {for (ftr in cts)
print ftr, cts[ftr] }' \
GCF_000146045.2_R64_genomic.gff |
sort -t, -k 2 -nr You can execute this script by calling it as an argument to bash:
bash yeast_feature_counts.sh > yeast_features.csv8.4.2 Reusability
You developed a pipeline to solve one instance of a problem. Is this a type of computation you do repeatedly or are likely to do in the future on different inputs?
Create a bash script “template” by re-writing the script to include variables for specifying input and output file name, key parameters, etc.
# feature_count_template.sh
#
# Purpose: count feature types in any GFF annotation
# output a CSV table sorted by freq of feature
# --- REPLACE THESE VARIABLES --- #
GFF_URL="" # RefSeq URL Location
OUT_DIR="output" # Directory for output, will be created if necessary
# -- END USER DEFINED VARIBLES -- #
# Create output directory if it doesn't exist
if [ ! -e ${OUT_DIR} ]; then
mkdir -p ${OUT_DIR}
fi
# intermediate files written to OUT_DIR
gff_gz=${OUT_DIR}/gff.gz
gff_file=${OUT_DIR}/features.gff
# download and unzip the data if
if [ ! -f ${gff_file} ]; then
wget ${GFF_URL} -O ${gff_gz}
gunzip ${gff_gz} -c > ${gff_file}
fi
# run the analysis sending the results to stdout
awk -F"\t" -v OFS="," \
'NF == 9 {cts[$3] += 1}
END {for (ftr in cts)
print ftr, cts[ftr] }' \
${gff_file} |
sort -t, -k 2 -nr If you, or someone you gave the script to, wanted to use this to run the analysis on another organism you would make a copy:
cp feature_count_template.sh ecoli_feature_count.shmodify the GFF_URL and OUT_DIR lines appropriately:
# ecoli_feature_count.sh
#
# Purpose: count feature types in E coli genome
# Output: a CSV table sorted by freq of feature
GFF_URL="https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz"
OUT_DIR="Escherichia_coli"
... <rest of template stays the same...>and then run the script:
bash ecoli_feature_count.sh8.4.3 Encapsulation and Abstraction
You developed a pipeline that carries out a useful computation that you want to integrate into other analyses or pipelines
Use a bash script to turn your pipeline into a command that can be integrated into other pipelines, following the Unix philosophy. From the user’s perspective, you can think of this as “abstracting away” the details of how the analysis works. They no longer have to understand the details of how the program or command works, they simply have to know what data to call it with and the form of the output.
Here we achieve this in a very simple way – by providing a way for our script to process arguments provided at the command line.
#!/usr/bin/env bash # feature_count.sh -- download a GFF file from RefSeq and # count the feature types nargs=$# if [ $nargs -ne 2 ]; then echo "usage: $0 URL OUTDIR" exit 1 fi GFF_URL=$1 # RefSeq URL Location OUT_DIR=$2 # Directory for output, will be created if necessary # Create output directory if it doesn't exist if [ ! -e ${OUT_DIR} ]; then mkdir -p ${OUT_DIR} fi # intermediate files written to OUT_DIR gff_gz=${OUT_DIR}/gff.gz gff_file=${OUT_DIR}/features.gff # download and unzip the data if if [ ! -f ${gff_file} ]; then wget ${GFF_URL} -O ${gff_gz} gunzip ${gff_gz} -c > ${gff_file} fi # run the analysis sending the results to stdout awk -F"\t" -v OFS="," \ 'NF == 9 {cts[$3] += 1} END {for (ftr in cts) print ftr, cts[ftr] }' \ ${gff_file} | sort -t, -k 2 -nr
Make this script executable (chmod +x feature_count.sh). You can call it directly from your working directory or add it to your PATH. Having done this you can use this script by specifying a URL and a directory name to write the output files to. Since the URL is long I assign it to a variable:
celegans_url=https://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Caenorhabditis_elegans/reference/GCF_000002985.6_WBcel235/GCF_000002985.6_WBcel235_genomic.gff.gzAnd then call the bash script
feature_count.sh $celegans_url celegans8.4.3.1 Back to reproducibility
By encapsulating our pipeline we’ve made it reusable and generalizable to analyses of arbitrary GFF annotation files. However, we’ve lost an element of reproducibility because we’re no longer documenting the data sources used when we invoke it at the command line. Luckily it’s easy to restore that with – you guessed it – another bash script!
To illustrate this, let’s assume we wanted to compare feature counts across multiple genomes. Let’s put species names and corresponding RefSeq URLs in a CSV file called genomelist.csv:
Saccharomyces_cerevisiae,https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/reference/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz
Escherichia_coli,https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz
Caenorhabditis_elegans,https://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/Caenorhabditis_elegans/reference/GCF_000002985.6_WBcel235/GCF_000002985.6_WBcel235_genomic.gff.gz
Now we can generate feature counts for each genome by creating a bash script called multi_species_count.sh that includes a call to parallel:
# multi_species_count.sh
parallel --colsep="," 'mkdir -p {1}; ./feature_count.sh {2} {1} > {1}/ftrcounts.csv' :::: genomelist.csvand run this as:
bash multi_species_count.shFor each species name in your input list, there should now be a corresponding directory that contains a ftrcounts.csv file with the respective counts:
$ ls -1 */ftrcounts.csv
Caenorhabditis_elegans/ftrcounts.csv
Escherichia_coli/ftrcounts.csv
Saccharomyces_cerevisiae/ftrcounts.csvIn terms of reproducibility, our full analysis now requires three files:
- The
feature_count.shscript - The list of genomes we’re processing:
genomelist.csv - Our
multi_species_count.shthat processes the information in (2) through the pipeline in (1)
There is a slight increase in complexity versus our single species script, but with the following advantages:
It’s trivial to extend our analysis to include new species/genomes of interest – we simply add additional species names and URLs to
genomelist.csvWe can run this analysis efficiently for many genomes at once by adding the argument
--jobs=nto the call toparallelwherenis the number of threads we want to run simultaneously (assuming that you have access to a multi-CPU machine or a compute cluster)
8.5 Other resources
The above is a short intro to bash scripting. Some other useful resources include:
- https://linuxconfig.org/bash-scripting-tutorial-for-beginners – provides a good overview of bash
- https://www.shellscript.sh/ – an in-depth tutorial on bash scripting