Tutorial: denovo

Timings

Run denovo_map (~45 min)
Exploring the VCF file with VCFtools (~15 min)

All steps done in EVE cluster

Objectives

This exercise will show you how to run denovo_map in Stacks from start to finish (assuming that you have demultiplexed data after running process_radtags). For this exercise we will use some already processed data for Hyphaene coriacea, a palm species occurring on Madagascar. We’ll then check the quality of the output files using VCFtools and populations to do some further filtering.

Some notes:

Paper: Rochette and Catchen 2017 that paper is a bit out of date, since Stacks 2 came in 2019 with many new ways of dealing with paired-end data (IMPORANT: don’t merge/concatenate the 4 output files from process_radtags!!), otherwise it is really useful.

All the scripts from the paper can be found here

1. Run denovo_map

Parameter optimization with denovo_map.pl

After process_radtags you will have 4 files per sample (sample.1.fq, sample.2.fq, sample.rem.1.fq, sample.rem.2.fq) and you can run STACKS step by step, or run the denovo_map.pl wrapper, which does the same in just one line. It’s highly recommended to do parameter optimization before continuing with the rest of final steps:

Paper: Paris, Stevens, & Catchen, 2017

For the parameter optimization, the denovo_map.pl program is used, running it several times changing the different parameters as recommended (m=3, M=n). The denovo_map.pl will use a test.popmap to know which samples to use for the run. To make a test.popmap choose a random subset of samples from all populations (e.g., 2-3 individuals per pop) and specify that all samples belong to the same pop even if they don’t (just for the optimization).

This script is going to run denovo_map.pl 9 times (M=n=1-9), and create an output with subsequent folders (stacks.$M)


#!/bin/bash
#SBATCH -J denovo_map_test.parameters
#SBATCH --mail-user=YOUREMAIL@gmail.com
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#SBATCH --output=/work/%u/%x-%j.out
#SBATCH --error=/work/%u/%x-%j.err 
#SBATCH --cpus-per-task=20 
#SBATCH --mem-per-cpu=8G
#SBATCH -t 48:00:00

# Set the requested number of cores to the number of Threads your app should use
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

# Paths and filenames for this analysis

M_values="1 2 3 4 5 6 7 8 9"
WORK_DIR="/work/$USER/ddRAD-seq_workshop"
popmap="/work/$USER/ddRAD-seq_workshop/data/Exercise_3/popmaps/test.popmap.txt"
OUT_DIR="/work/$USER/ddRAD-seq_workshop/outputs/Exercise_3/test.denovo"

#Create OUT_DIR and subdirectories

cd "$WORK_DIR" || exit 
mkdir "$OUT_DIR"
cd "$OUT_DIR" || exit 

for M in $M_values
do
    mkdir stacks.M"$M"
done

## Load modules and activate software
module purge
module load Anaconda3
source activate /global/apps/stacks_2.61/
  
# denovo_map.pl - it will execute the Stacks pipeline by running each of the Stacks components #individually: ustacks, cstacks, sstacks, tsv2bam, gstacks and populations. 
# We are doing this to select the parameters M (ustacks) and n (cstacks) which optimal value depends #on the amount of genetic diversity within the species and on the quality of the raw data as well. 
# This is done with only a subset of samples from all the populations. This subset is written in the #test.popmap files and therefore Stacks will only run the analyses over those samples specified. We will #vary M and n (M=n) from 1 to 9, and set m = 3.   

# -samples = file path to the samples (samples will be read from population map)
# --popmap = file path to the population map (<sample name><TAB><population>)
# -o = file path to write the pipeline output files
# -X = additional options for specific pipeline components, e.g. -X "populations: --min-maf 0.05". We will #run populations separately afterwards
# -M = number of mismatches allowed between stacks within individuals (for ustacks)
# -n =number of mismatches allowed between stacks between individuals (for cstacks)
# -m = Minimum depth of coverage required to create a stack (default 3)
# --paired = after assembling RAD loci, assemble contigs for each locus from paired-end reads
# --rm-pcr-duplicates = remove all but one set of read pairs of the same sample that have the same #insert length
# -r = minimum percentage of individuals in a population required to process a locus for that population #(for populations; default: 0)
# -T = the number of threads/CPUs to use (default: 1)

# Run denovo_map on the subset of samples set by the test.popmap

for M in $M_values 
do
    out_dir="$OUT_DIR/stacks.M$M"
    reads_dir="$WORK_DIR/data/Exercise_3/demultiplexed_data/HC6"
        log_file="$out_dir"/denovo_map.oe
denovo_map.pl --samples "$reads_dir" --popmap "$popmap" -o "$out_dir" -T "$SLURM_CPUS_PER_TASK" -M "$M" -n "$M" -m 3 --paired &> "$log_file"
done

# Run populations again with '-r 0.80' (loci present in 80% of samples)

for M in $M_values 
do
    stacks_dir=stacks.M"$M"
    out_dir="$stacks_dir"/populations.r80
    mkdir "$out_dir"
    log_file="$out_dir"/populations.oe
    populations -P "$stacks_dir" -O "$out_dir" -t "$SLURM_CPUS_PER_TASK" -r 0.80 &> "$log_file"
done

In each folder stacks.$M we will have all the output from STACKS and another folder, populations.r80 with the results from running populations showing only the number of polymorphic loci shared across 80% of the samples (the r80 loci). From these runs we need to extract some important information to be able to choose the best value of M=n for our data.

#!/bin/bash
#SBATCH -J extract_results
#SBATCH --mail-user=YOUREMAIL@gmail.com
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#SBATCH --output=/work/%u/%x-%j.out
#SBATCH --error=/work/%u/%x-%j.err 
#SBATCH --cpus-per-task=2 
#SBATCH --mem-per-cpu=3G
#SBATCH -t 1:00:00

# Paths and filenames for this analysis

M_values="1 2 3 4 5 6 7 8 9"
WORK_DIR="/work/$USER/ddRAD-seq_workshop/output/Exercise_3/test.denovo"

## Load modules and activate software

module purge
module load Anaconda3
source activate /global/apps/stacks_2.61/
  
cd "$WORK_DIR" || exit 
mkdir "$WORK_DIR/results"


for M in $M_values
do
stacks-dist-extract stacks.M"$M"/populations.r80/populations.log.distribs snps_per_loc_postfilters >> results/M"$M"_snp_distribution.tsv
cat stacks.M"$M"/populations.r80/populations.sumstats.tsv | grep -v "^#" | cut -f 1 | sort -n | uniq | wc -l >> results/M"$M"_r80.polymorphicLOCI.tsv
awk 'NR == 6 {print $5}' stacks.M"$M"/populations.r80/populations.sumstats_summary.tsv >> results/M"$M"_r80.polymorphicLOCI_summary.tsv
cat results/*.polymorphicLOCI.tsv >> results/all.polymorphicLOCI.tsv
cat  results/*.polymorphicLOCI_summary.tsv > results/all.polymorphicLOCI_summary.tsv
done

Then, in the /results folder you will be able to find: - Total number of assembled loci: count the number of lines in the populations.haplotypes.tsv file. - Number of polymorphic r80 loci is in the populations.sumstats_summary.tsv file. - the SNPs-per-locus after filtering distributions in the populations.log.distribs. This is possible with the “stacks-dist-extract” utility from Stacks (this step in included in the script). To know which M=n to use, plot the polymorphic loci r80 per each parameter setting and the new polymorphic loci found while increasing M:

To know which M=n to use, plot the polymorphic loci r80 per each parameter setting and the new polymorphic loci found while increasing M:

[SHOW FIGURES]

When the number of polymorphic loci does not get significantly higher by making M=n higher, then that’s the M=n we choose. In this case it would be M=4

Run denovo_map.pl

Now that we have optimized the parameters we can run denovo_map.pl program with the chosen M. Here you could decide to make m (depth coverage) higher, although that can also be filtered later with VCFtools

#!/bin/bash
#SBATCH -J denovo_map_full
#SBATCH --mail-user=YOUREMAIL@gmail.com
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#SBATCH --output=/work/mendez/%x-%j.out
#SBATCH --error=/work/mendez/%x-%j.err 
#SBATCH --cpus-per-task=14 
#SBATCH --mem-per-cpu=8G
#SBATCH -t 48:00:00

# Set the requested number of cores to the number of Threads your app should use
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

# Paths and filenames for this analysis

M=SETYOURM
WORK_DIR="/work/$USER/ddRAD-seq_workshop"
popmap="/work/$USER/ddRAD-seq_workshop/data/Exercise_3/popmaps/popmap6.txt"

out_dir="/$WORK_DIR/output/Exercise_3/stacks.denovo"
reads_dir="$WORK_DIR/data/Exercise_3/demultiplexed_data/HC6"
log_file="$out_dir"/denovo_map.oe

cd "$WORK_DIR" || exit 
mkdir "$out_dir"

## Load modules and activate software

module purge
module load Anaconda3
source activate /global/apps/stacks_2.61/
  
# denovo_map.pl - it will execute the Stacks pipeline by running each of the Stacks components individually: ustacks, cstacks, sstacks, tsv2bam, gstacks and populations. 
# We are doing this to select the parameters M (ustacks) and n (cstacks) which optimal value depends on the amount of genetic diversity within the species and with the quality of the raw data as well. 
# Therefore this has to be done with every species separately, with only a subset of samples from all the populations. This subset is written in the test.popmap files and therefore Stacks will only 
# run the analyses over those samples specified. We will vary M and n (M=n) from 1 to 9, and set m = 3.   

# -samples = file path to the samples (samples will be read from population map)
# --popmap = file path to the population map (<sample name><TAB><population>)
# -o = file path to write the pipeline output files
# -X = additional options for specific pipeline components, e.g. -X "populations: --min-maf 0.05". We will #run populations separately afterwards
# -M = number of mismatches allowed between stacks within individuals (for ustacks)
# -n =number of mismatches allowed between stacks between individuals (for cstacks)
# -m = Minimum depth of coverage required to create a stack (default 3)
# --paired = after assembling RAD loci, assemble contigs for each locus from paired-end reads
# --rm-pcr-duplicates = remove all but one set of read pairs of the same sample that have the same #insert length
# -r = minimum percentage of individuals in a population required to process a locus for that population #(for populations; default: 0)
# -T = the number of threads/CPUs to use (default: 1)


# Run denovo_map on the subset of samples told by the popmap

denovo_map.pl --samples "$reads_dir" --popmap "$popmap" -o "$out_dir" -T "$SLURM_CPUS_PER_TASK" -M "$M" -n "$M" -m 3 --paired -X "populations: --vcf" &> "$log_file"

2. Exploring the VCF file with VCFtools

VCFtools – If you want to know more, you can check this tutorial

With VCFtools you can get :

--freq2 = allele frequency --depth = mean depth per individual --site-mean-depth = mean depth per site --missing-indv = proportion of missing data per individual --missing-site = proportion of missing data per site --het = heterozygosity and inbreeding coefficient per individual

#!/bin/bash

#SBATCH -J vcftools
#SBATCH --mail-user=YOUREMAIL@gmail.com
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#SBATCH --output=/work/mendez/%x-%j.out
#SBATCH --error=/work/mendez/%x-%j.err  
#SBATCH --mem-per-cpu=4G
#SBATCH -t 48:00:00

# Paths and filenames for this analysis

WORK_DIR="/work/$USER/ddRAD-seq_workshop"
out_dir="/work/$USER/ddRAD-seq_workshop/outputs/Exercise_3/stacks.denovo/VCFtools"
mkdir "$out_dir"
vcf_dir="$WORK_DIR/stacks.denovo/VCFtools/ m5-100_miss0.50_2alleles_.vcf"
log_file="$out_dir"/vcftools.oe

## Load modules and activate software

module load foss/2019b VCFtools/0.1.16

# VCFtools - vcftools is a suite of functions for use on genetic variation data in the form of VCF and BCF #files. 
#The tools provided will be used mainly to summarize data, run calculations on data, filter out data, and #convert data into other useful file formats.
# SYNOPSIS:
# vcftools [ --vcf FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT PREFIX ] [ FILTERING OPTIONS ] [ OUTPUT OPTIONS ]

# Run VCFtools to calculate some basic stats from out vcf files

cd "$out_dir" 
    vcftools --vcf "$vcf_dir" --freq2 --out "./freq2" --max-alleles 2 &> "$log_file"
    vcftools --vcf "$vcf_dir" --depth --out "./ind_depth" &> "$log_file" 
    vcftools --vcf "$vcf_dir" --site-mean-depth --out "./mean_depth_site" &> "$log_file"
    vcftools --vcf "$vcf_dir" --site-quality --out "./site_quality" &> "$log_file"
    vcftools --vcf "$vcf_dir" --missing-indv --out "./missing_ind" &> "$log_file"
    vcftools --vcf "$vcf_dir" --missing-site --out "./missing_ind" &> "$log_file"
    vcftools --vcf "$vcf_dir" --het --out "./het" &> "$log_file"

The most important file at this point is the .imiss file which contains the percentage of missing data per individual. It has been shown that deleting individuals with high percentage of missing data can help recovering higher number of loci.

Paper: Cerca et al. 2021

Check the .imiss file, identify individuals with high missing data and create a new popmap (e.g., popmap_90, if you have deleted every individual with more than 90% of missing data) and when running populations later for the filtering, you need to use the new “popmap_90” and further filter the data to get our final dataset.

Filtering with VCFtools and populations

For filtering with VCFtools:

#!/bin/bash

#SBATCH -J vcftools_filtering
#SBATCH --mail-user=YOUREMAIL@gmail.com
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#SBATCH --chdir /work/YOURUSER    
#SBATCH --output=/work/%u/%x-%j.out
#SBATCH --error=/work/%u/%x-%j.err  
#SBATCH --mem-per-cpu=4G
#SBATCH -t 48:00:00
# Paths and filenames for this analysis

WORK_DIR="/work/$USER/ddRAD-seq_workshop"

out_dir="/work/$USER/ddRAD-seq_workshop/outputs/Exercise_3/stacks.denovo/VCFtools"
vcf_dir="$WORK_DIR/stacks.denovo/populations.snps.vcf"
log_file="$out_dir"/vcf_filtering_m5-100_miss0.25_2alleles.oe
    
## Load modules and activate software

module load foss/2019b VCFtools/0.1.16

# VCFtools - vcftools is a suite of functions for use on genetic variation data in the form of VCF and BCF files. 
#The tools provided will be used mainly to summarize data, run calculations on data, filter out data, and convert data into other useful file formats.
# SYNOPSIS:
# vcftools [ --vcf FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT PREFIX ] [ FILTERING OPTIONS ] [ OUTPUT OPTIONS ]

# Run VCFtools to calculate some basic stats from out vcf files per species 

cd "$out_dir" 
vcftools --vcf "$vcf_dir" --remove-indels --max-missing 0.50 --min-alleles 2 --max-alleles 2 --min-meanDP 5 --max-meanDP 100 --minDP 5 --maxDP 100 --recode --out "./m5-100_miss0.50_2alleles_" &> "$log_file"

Filtering with populations

populations - There are four primary filters that should be used, particularly in de novo analyses, to control for false-positive loci. They control (i) the fraction of individuals of a single population a locus must be found in to be processed (-r), (ii) the number of populations that a locus must be found in to be processed (-p), (iii) the minimum allele frequencies a variable site must possess to be included (–min_maf), and (iv) the maximum level of heterozygosity a variable site can possess to be included (–max_obs_het). The -r and -p filters primarily provide biological control (e.g., a population genetic analysis (widely available loci) versus a phylogenetic study (maximum loci across any subset of species)). Most widely used parameters for population genetic studies are:

r=0.80
min-maf=0.05
max-obs-het=0.70

#!/bin/bash

#SBATCH -J populations
#SBATCH --mail-user=YOUREMAIL@gmail.com
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#SBATCH --output=/work/mendez/%x-%j.out
#SBATCH --error=/work/mendez/%x-%j.err 
#SBATCH --cpus-per-task=4 
#SBATCH --mem-per-cpu=6G
#SBATCH -t 48:00:00

# Set the requested number of cores to the number of Threads your app should use
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

# Paths and filenames for this analysis
 
WORK_DIR="/work/$USER/ddRAD-seq_workshop"

out_dir="/work/$USER/ddRAD-seq_workshop/outputs/Exercise_3/stacks.denovo/populations/populations.singleSNP.r075.m5.maf005.het07"
mkdir "$out_dir"
vcf_dir="$WORK_DIR/stacks.denovo/VCFtools/filtered.m5-100_miss0.50_2alleles_.vcf"
popmap="/work/$USER/ddRAD-seq_workshop/data/Exercise_3/popmaps/popmap_90.txt"
log_file="$out_dir"/populations.oe

## Load modules and activate software

module purge
module load Anaconda3
source activate /global/apps/stacks_2.61/
  
# populations - it will analyze a population of individual samples computing a number of population genetics statistics 
# as well as exporting a variety of standard output formats. A population map specifying which individuals belong to which 
# population is submitted to the program and the program will then calculate population genetics statistics such as expected/observed 
# heterozygosity, π, and FIS at each nucleotide position. The populations program will compare all populations pairwise to compute FST.  
# The populations program provides strong filtering options to only include loci or variant sites that occur at certain frequencies in  
# each population or in the metapopulation.

# -P = path to the directory containing the Stacks files (the gstacks output).
# --popmap = file path to the population map (<sample name><TAB><population>)
# -O = file path to write the pipeline output files
# -p = minimum number of populations a locus must be present in to process a locus.
# -m = coverage threshold
# -r = minimum percentage of individuals in a population required to process a locus for that population.
# --min-maf = specify a minimum minor allele frequency required to process a nucleotide site at a locus (0 < min_maf < 0.5).
# --write-single-snp = restrict data analysis to only the first SNP per locus.
# --write-random-snp = restrict data analysis to one random SNP per locus.
# --fstats — enable SNP and haplotype-based F statistics.
# -T = the number of threads/CPUs to use (default: 1)

# Run populations with "-r 0.75" (loci present in 75% of samples), min-maf 0.05 (a variable site must possess a minimum allele frequency of 5% to be included) 
# --max-obs-het 0.7 (maximum level of heterozygosity a variable site can possess to be included) and writing only one single SNP (--write-single-snp).

populations -V "$vcf_dir" -O "$out_dir" --popmap "$popmap" \
-t "$SLURM_CPUS_PER_TASK" -r 0.75 --min-maf 0.05 --max-obs-het 0.7 --write-single-snp --fstats --hwe --vcf --plink --phylip --phylip-var --phylip-var-all &> "$log_file"

Notes on structure of files and popmaps

To run ADMIXTURE

To run ADMIXTURE, you need a .bed file which you need to create with PLINK. To use the .vcf file with PLINK, you need to modify it in R first, changing the chromosome number from numeric to character, for example like this: 4 -> contig_4.

rm(list=ls())
############# Script to prepare vcf file for PLINK - Walter Durka and Laura Mendez ############
setwd("YOURPATH")

library(vcfR)
library(adegenet)

vcf <- read.vcfR("filtered.m5-100_miss0.50_2alleles.recode.p.snps.vcf")

new.file<-"outputs/PLINK_VCF/plink.vcf.gz"

vcf@fix[,1]<-paste("contig_",vcf@fix[,1],sep="")

write.vcf(vcf, file = new.file, mask = FALSE, APPEND = FALSE)09_PLINK.sh

Then make a new folder in your cluster like so /work/$USER/ddRAD-seq_workshop/outputs/Exercise_3/stacks.denovo /PLINK and transfer the new plink.vcf file there. Then you can run PLINK using that file:

#!/bin/bash

#SBATCH -J PLINK
#SBATCH --mail-user=YOUREMAIL@gmail.com
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#SBATCH --output=/work/%u/%x-%j.out
#SBATCH --error=/work/%u/%x-%j.err
#SBATCH --mem-per-cpu=4G
#SBATCH -t 10:00:00


#Paths and filenames for this analysis
WORK_DIR="/work/$USER/ddRAD-seq_workshop/outputs/Exercise_3/stacks.denovo"
out_dir="$WORK_DIR/PLINK"
vcf_dir="$WORK_DIR"/PLINK/plink.vcf.gz
log_file="$out_dir"/plink.oe

## Load modules and activate software
module load Anaconda3
source activate /data/Popgen/programs/plink-1.90

# PLINK - free open-source toolset to perform a range of basic, large-scale analyses in a computationally efficient manner.
# The focus of PLINK is purely on analysis of genotype/phenotype data
# NOTE that in the vcf file, "chromosomes" (=contig) must not be numeric, e.g. "41", but non-numeric "contig_41"
# Run PLINK to sort chromosomes
plink --vcf "$vcf_dir" --make-bed --allow-extra-chr --out "$out_dir/plink_sorted"
cd "$out_dir" || exit
mkdir results

## Run PLINK to create a .bed file from a vcf file for subsequent analysis with ADMIXTURE which cannot handle chromosome names and therefore we overwrite the first column names with zeroes
plink --bfile "$out_dir/plink_sorted" --het --hardy --make-bed --out "$out_dir/results/output" --allow-extra-chr --recode 12
awk '{$1=0;print}' "$out_dir/results/output.bim" > "$out_dir/results/output.bim.tmp"
mv "$out_dir/results/output.bim.tmp" "$out_dir/results/output.bim"

You can run ADMIXTURE assuming 1-10 clusters (K), running a total of 5 replicates for each K (change the seed for each run, otherwise, the default seed is always the same: 43), and determined the best K by estimating the cross-validation error. 10_ADMIXTURE_all_5reps.sh

To run RAxML

To run RAxML, if you want to have a tree per individual, you first need to run populations one more time, with a special popmap where each individual belongs to a different population (just name each pop different: Bm1, Bm2, Bm3, …. for each individual). And set the following flags: –phylip –phylip-var –phylip-var-all

Tutorial: denovo_map