Part 1 of the Museomics Workshop
This tutorial is Part 1 of the Museomics Workshop (CVZoo XIV 2025, University of SĂŁo Paulo, Brazil). Because most programs used to assemble historical DNA sequences in museomics and to perform downstream phylogenetic analyses do not provide a graphical user interface (GUI), the goal of this tutorial is to introduce students to the Linux command line.
Before starting the tutorial, download the resources here. Unzip the files and go to the output directory (try cd /mnt/c/Users/Aluno/Downloads/CVZoo*/CVZoo/).
Check Part 2 of the Museomics Workshop here.
Linux was created in 1991 by Linus Torvalds (1969, Finland) as a free and open-source kernel that controls CPU, manages memory, handles files, and communicates with hardware. However, kernels are not operating systems (i.e. Linux is not like Windows and macOS). In parallel with Linux, the GNU Project was initiated by Richard Stallman (1953, USA) in 1983 to include command-line tools, compilers, libraries, and shells. As such, most systems commonly referred to as “Linux” should more properly be called “GNU/Linux”.
Linux systems can be distributed in different packages (distros). Some popular distros are:
| Distribution | Pros | Cons |
|---|---|---|
| Ubuntu | Beginner-friendly; large community and documentation; excellent hardware support; widely used in servers, cloud, and bioinformatics | Heavier system; less conservative than Debian; some design choices are controversial |
| Debian | Extremely stable and reliable; conservative updates; large repositories; foundation of many other distros | Older software versions; hardware support may lag; less beginner-oriented |
| Arch Linux | Minimal and highly customizable; rolling release; excellent documentation (Arch Wiki); great for learning Linux internals | Steep learning curve; manual installation; higher risk of breakage; not beginner-friendly |
| SteamOS | Optimized for gaming; excellent performance on Steam Deck; showcases Linux on consumer devices | Not designed for general-purpose computing; limited customization; unsuitable for scientific workflows |
| Zorin OS | Very beginner-friendly; Windows/macOS-like interface; Ubuntu-based with good software support | Smaller community; limited flexibility; rarely used in research or servers |
| Linux Mint | User-friendly and stable; traditional desktop layout; lightweight; based on Ubuntu or Debian | Slower adoption of new technologies; desktop-focused; uncommon in servers or HPC |
When mobile distributions are taken into account, Android is the most popular Linux-based system. As such, most people are Linux users, even if they are unaware of it.
Windows is the most popular operating system. However, most programs used in bioinformatics are only available for Linux and macOS, because both are UNIX-based (a family of operating systems designed for scientific computing in 1969-1971 at AT&T Bell Labs).
To use Linux commands in Windows, there are four options:
Given that servers are unavailable, Dual Boot requires splitting the disk, and virtual machines are slow in most cases, the best solution is WSL. Most software is compatible with WSL. The first version of WSL was released in 2016 by Microsoft, but most tools were incompatible. Recently, WSL 2 was released in 2019, with native support for Docker, Conda, and modern Linux tools. Microsoft created WSL because Linux dominates servers, cloud, high-performance, and scientific computing. Using WSL, beginner scientists can continue using Windows but also using Linux tools (although I recommend that all scientists migrate to Linux).
To install WSL 2, Windows 10 or 11 are required:
wsl --install sudo apt update
sudo apt upgrade
The Windows files are usually located at /mnt/c/.
To run Linux from the command line, the user must open a shell, a program that sends the commands from the user to the operating system. In GNU/Linux, the most common shell is called Bash (Bourne Again Shell).
In Bash, a standard input (stdin e.g. hDNA reads or cladistic matrix) is given to a command, which produces the standard output (stdout e.g. assembled contigs or phylogenetic trees). In addition to standard input and output, programs also produce a standard error (stderr), which is typically used for warnings and error messages.
In Bash, the commands are case-sensitive (e.g. echo is a valid command, whereas Echo or ECHO are not). Moreover, pressing the “Tab” key attempts to autocomplete commands and file names (helpful for lazy users). The semicolon ; executes commands sequentially and independently, whereas the pipe | connects the output of one command directly to the input of the next.
When working in Linux, especially in bioinformatics projects, analyses are organized into directories containing raw data, intermediate files, and results. Efficient navigation of the filesystem is therefore essential.
pwd # show current directory
ls . # list files in the current directory
ls -lh # list files with sizes
ls -a # list all files (including hidden files)
ls -lhS # list files sorted by size
ls -ltr # list files sorted by modification time
cd .. # go up one directory
cd ../.. # go up two directories
cd ~ # go to home directory
cd / # go to the root directory
cd - # return to the previous directory
file # classify a file
du -sh # size of current directory
Exercise 1
In a single line, which command sequentially (1) list files with sizes in the current directory, (2) list all files in the parent directory, (3) go to the parent directory, and (4) show the path to the current directory?
A)
ls -lh .; ls -a ..; cd ..; pwdB)
ls -lh ..; ls -a .; cd ..; pwd ..C)
ls -lh . ls -a .. cd .. pwdD)
ls -lh .| ls -a .. | cd .. | pwdSee the answer
A
Exercise 2 Consider that you are in museomics/part1/ex2:
museomics/ └── part1/ ├── ex2/ │ ├── sample1.fq │ ├── sample2.fq │ └── consensus.fas └── ex3/Which single-line command prints the current directory, lists all files in the current directory, identifies the file type of sample1.fq, shows the total disk usage of the directory.
A)
pwd; ls -a; file sample1.fq; du -shB)
pwd; ls -a; file ../sample1.fq; du -sh dataC)
pwd | ls | file sample1.fq | du -sh ..D)
pwd; ls; file raw/sample1.fastq; du -sh .See the answer
A
Once you can navigate the filesystem, the next essential skill is creating, copying, moving, renaming, and deleting files and directories. These operations are fundamental in bioinformatics, where workflows typically involve organizing raw data, intermediate files, and results into structured directories.
cp file . # duplicate files to the current directory
cp -r directory . # copy a directory to the current directory
mkdir new_dir # make a new directory called new_dir
mkdir -p new/dir # create parent directories if necessary
mv ../file . # move files
rm file # remove files
rm -r directory # remove directories
rm -ri directory # remove directory asking for confirmation
Exercise 3 You are in museomics/part1/ex3, a directory containing:
consensus.fas sample1.fq sample2.fq notes.txtWrite a single command line that: (1) Creates a directory called
raw, (2) moves all FASTQ (= fq) files intoraw.See the answer
mkdir raw; mv *.fq raw/
Exercise 4 You are working in the directory museomics/part1/ex4, which contains the following files:
consensus.fas sample1_R1.fq sample1_R2.fq sample2_R1.fq sample2_R2.fq sample3_R1.fq sample3_R2.fq control_R1.fq control_R2.fq notes.txtIn a single command, (1) create the directories analysis/reads and analysis/reads2 if it does not exist, (2) copy the FASTQ files from samples 1 and 3 (both R1 and R2) into analysis/reads, and (3) copy all files R2 into analysis/reads2.
See the answer
mkdir -p analysis/reads; mkdir analysis/reads2; cp sample[13]_R*.fastq analysis/reads/; cp *R2.fq analysis/reads2/
Bioinformatics workflows rely on text files, including FASTA, FASTQ, SAM/BAM (text/binary), VCF, and comma- and tab-delimited tables. Linux provides powerful command-line tools to view these large text files.
cat file # print the file
echo hello # print strings
less file # view files safely
less -N file # view files with line numbers
head file # print the first 10 lines
head -n 20 file # print the first 20 lines
paste file1 file2 # merge two files
paste - - # merge each two lines from stdin
tail file # print the last 10 lines
tail -n 20 file # print the last 20 lines
wc file # count lines, words, and characters
wc -l file # count lines
wc -w file # count words
wc -c file # count bytes
wc -m file # count characters
zcat file # print gzipped files in linux
gzcat file # print gzipped files in macOS
In addition to text visualization, Linux provides commands to manipulate files.
grep # search for patterns
grep -i # search for patterns ignoring case
grep -v # inverted search (excluding patterns)
grep "ˆ" # match beginning of line
grep "$" # match end of line
cut # extract specific columns from tables
cut -d # specify delimiter (default: TAB)
sed 's//g' # substitute 's/pattern/new_pattern/g'
sed -i 's//g' # substitute within the file (irreversible)
sed '1~3 y/actg/ACTG/' # substitute actg with ACTG in lines 1 + every 3 lines (only works in GNU/Linux, not macOS)
sort # sort lines alphabetically or numerically
sort -r # sort using reverse order
sort -k # sort by column
sort -t # specify delimiter
sort -u # unique lines
tr acgt ACGT # translate each character in set1 to each character in set2
uniq # deduplicate lines (requires sorted input)
nano # beginner-friendly text editor
vi # advanced text editor
Exercise 5 You are working in the directory museomics/part1/ex5, which contains the FASTA file called
genes.fasta. The first three sequences are:> COI_Homo_sapiens ATGCTAGCTAGCTAGC > COI_Mus_musculus ATGCTAGCTAGCTAGC > cytb_homo_sapiens ATGCGGATCGATCGTAThere are dozens of sequences, so counting or manipulating them manually is too laborious. Which commands should we use to (1) list only the FASTA headers, (2) count how many sequences are in the file, and (3) count how many sequences contain the word Homo, ignoring case.
Show answer
(1) grep "^>" genes.fasta
(2) grep "^>" genes.fasta | wc -l
(3) grep -i "homo" genes.fasta | wc -l
Exercise 6 You are working in the directory museomics/part1/ex6, which contains the FASTQ file called
reads.fq.zip. The first four reads are:@read003 atGCTAGCtAGC + IIIIIIIIIIII @read002_LOWQUAL atgctagnnnn + !!!!!IIIII!! @read003 atGCTAGCtAGC + IIIIIIIIIIII @read004_LOWQUAL ATGCGTAGCTAG + !!!!!!!IIIII(1) Remove all reads marked as LOWQUAL (remove the entire read, not only the header), (2) deduplicate sequences, and (3) convert all nucleotides to uppercase.
Show answer
(1) cat reads.fq | paste - - - - | grep -v "LOWQUAL"
(2) cat reads.fq | paste - - - - | sort -u
(3) cat reads.fq | tr actgn ACTGN
Given that solution for the third task also replaces actg in the headers, a better solution (only works in Linux) is:
(3) cat reads.fq | sed '2~4 y/actgn/ACTGN/'
Installing tools efficiently is key to running analyses reproducibly. This tutorial covers the main ways to install and manage software in Linux. The examples are the programs required in the Part 2 of the Museomics Workshop: gargamel, FASTQC, Cutadapt, Tally, FastqScreen, BWA, MITObim, samtools, bamtools, TNT, and PhyG.
Some software comes as a ready-to-use binary, meaning you don’t need to compile it. You only need to change the permissions of the binary and move it to the PATH.
Every file and directory in Linux has three types of permissions:
Every file has permissions specified for three types of users:
When you type ls -l in Bash, the first string in each line indicates the permissions. For instance:
> ls -l
-rwxr-xr-- 1 alice bioinfo 2345 Jan 18 12:00 analysis.sh
The user alice owns the file analysis.sh of 2345 bytes, last modified on January 18th. The permissions of this file for the owner, group, and others are rwx (reading, writing and executing), r-x (only reading and executing), and r-- (only reading), respectively. That is, users in the same group can only read and execute the file, whereas users that are not part of the group can only read it. If Alice wants to make this file executable to all users, she can use the command chmod +x analysis.sh. However, the program analysis.sh can only be used if the user is in the same directory where it is located (e.g. typing ./analysis.sh) or if the user specifies the path to it. This is not convenient, so moving the file analysis.sh to the PATH (a list of directories that Linux searches when you type a command) will allow Linux to easily execute analysis.sh.
> echo $PATH # list PATH directories
/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin
> mv analysis.sh /usr/local/bin/ # move the program to one of the PATH directories
TNT is the most popular program in phylogenetic analyses under parsimony and static homology (nucleotides aligned via a similarity function from MSA, assumes homologies a priori, based on phenetics). In contrast, PhyG performs phylogenetic analyses using dynamic homology (nucleotides aligned via tree-alignment, assumes that the best homology schemes are implied from the optimal tree). Both programs can be installed via binaries.
unzip tnt-linux.zip
chmod +x tnt
echo $PATH
mv tnt /usr/local/bin/
chmod +x phyg-linux-x86
echo $PATH
mv phyg-linux-x86 /usr/local/bin/
In addition to binaries, most programs in bioinformatics are available in Conda, which is an environment manager widely used in scientific computing to install software and manage dependencies in isolated environments. This is useful to avoid dependency conflicts.
Two popular Conda distributions are Anaconda (a large Conda distribution of ca. 3 GB containing hundreds of packages for data science) and Miniconda (a cleaner Conda distribution of ca. 70 Mb containing only Python and a few libraries). Bioconda is not a Conda distribution, but a repository hosting most bioinformatics software.
# Install conda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# Re-start shell
source ~/.bashrc
conda --version
Below, we will install all programs necessary for Part 2 of the Museomics Workshop.
seqkit will be installed in the conda environment called seqkit.
conda create -n seqkit bioconda::seqkit
conda activate seqkit
seqkit
conda deactivate
TNT and PhyG are not available in Bioconda. In contrast, FastQC v.0.12.1 is available and will be installed in the conda environment called fastqc.
# Install possible dependencies of java for fastqc
sudo apt update
sudo apt install -y libxtst libxrender1 libxi6 libxrandr2 libxinerama1 libxcursor1
# Create an environment
conda create -n fastqc bioconda::fastqc
conda activate fastqc
fastqc -h
conda deactivate
Cutadapt v.2.6 will be installed in the environment cutadapt with Python v.3.7.
conda create -n cutadapt python=3.7 bioconda::cutadapt=2.6
conda activate cutadapt
cutadapt -h
conda deactivate cutadapt
Tally (from Reaper v.16.098) will be installed in the environment tally.
conda create -n tally bioconda::reaper
conda activate tally
tally -h
conda deactivate
FastqScreen v.0.15.3 will be installed in the environment fastqscreen. In my experience, recent versions can perform poorly in Linux, so use this specific version.
conda create -n fastqscreen bioconda::fastq-screen=0.15.3
conda activate fastqscreen
fastq_screen -h
conda deactivate
Note that the environment is called fastqscreen, whereas the command is fastq_screen.
BWA v.0.7.17 will be installed in the environment bwa.
conda create -n bwa bioconda::bwa
conda activate bwa
bwa
conda deactivate
MITObim v.1.9.1. will be installed in the environment mitobim. In addition to MITObim, Mira, and MiraConvert will be automatically installed. For macOS and Windows, use Docker (instructions here; conda does not work, in my experience). For Linux (including Windows WSL), the following commands are currently working:
conda create -n mitobim bioconda::mitobim
conda activate mitobim
MITObim.pl -h
conda deactivate
Samtools and bamtools will be installed in the environment samtools.
conda create -n samtools -c conda-forge -c bioconda samtools=1.23
conda activate samtools
conda install bioconda::bamtools
samtools
bamtools
conda deactivate
| Method | When to use | Pros | Cons |
|---|---|---|---|
| Precompiled binaries | Single tool, simple | Fast, no compilation | Manual updates, dependency issues |
| Conda / Bioconda | Multiple tools, environments | Dependency management, reproducible | Disk space, environment conflicts |
| Compilation | Latest version, HPC optimization | System optimized | Error-prone, requires build tools |
| Docker | Reproducibility, pipelines | Isolated, cross-platform | Requires Docker, may be slow |
| Modules | HPC / cluster | Easy switching, no installation | Only on supported systems |
Check the Part 2 of the Museomics Workshop here.