Tutorial: F-statistics

Timings

Re-reun populations of Stacks to get FST and nucleotide diversity estimates per population (15 min)
Open the file in R to explore the outputs (2 min)

All done in EVE cluster

Objectives

As well as population structure and phylogeny, you probably also want to know population-specific information regarding genetic diversity and genetic differentiation. For this we can simply rerun populations from Stacks using the –fstats option which will give all kinds of interesting summary statistics about your data based on the population map you supply

In this exercise we will use some basic output files from Stacks that I’ve already processed (again for our treefrog Leptopelis flavomaculatus). We will then calculate pairwise FST between populations and provide specific summary statistics of genetic diversity across populations using Stacks

1. Run populations with –fstats flag to calculate genetic diversity and FST

#This is all done in the EVE cluster, will be fairly quick (ca. 10 mins), again build yoru own script or edit the existing one (08_FST_genetic_diversity.sh)


#!/bin/bash                                                                                                                            

#SBATCH --job-name=FST_genetic_diversity
#SBATCH --mail-user=christopher_david.barratt@uni-leipzig.de
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#SBATCH --output=/work/$USER/ddRAD-seq_workshop/job_logs/%x-%j.log
#SBATCH --cpus-per-task=1 
#SBATCH --mem-per-cpu=40G
#SBATCH --time=12:00:00

#load environment
module purge
module load Anaconda3
source activate /global/apps/stacks_2.61/

cd /work/$USER/ddRAD-seq_workshop/data/Exercises_4-8/FST_genetic_diversity/


# populations - 
# -P = working directory 
# -M = population map
# -t = max number of threads
# -m = coverage threshold
# -p = population limit threshold to include loci
# -r = individual limit threshold (per population) to include loci
# --structure --phylip = output file in structure/phylip format
# --write_random_snp = write only one random snp per locus
# --min_maf 0.05 = specify minimum minor allele frequencies to include loci

# It's 23 separate populations, so p=19 = 19/23 = only take SNPs present in 82.6% of data
populations -P /work/$USER/ddRAD-seq_workshop/data/Exercises_4-8/FST_genetic_diversity/Lflav -O /work/$USER/ddRAD-seq_workshop/data/Exercises_4-8/FST_genetic_diversity/Lflav -M /work/$USER/ddRAD-seq_workshop/data/Exercises_4-8/FST_genetic_diversity/Lflav_popmap_FST -t 1 -p 19 -r 0.5 --genepop --fstats --min-maf 0.05

# move files to outputs
mv /work/$USER/ddRAD-seq_workshop/data/Exercises_4-8/FST_genetic_diversity/Lflav/populations.* /work/$USER/ddRAD-seq_workshop/outputs/Exercise_8/

2. Explore the output files in R

Lots of files are generated by the populations module of Stacks when running –fstats. Most of these are pairwise estimates of genetic diversity between populations as defined in the population map (in our case 23 separate ‘populations’).

FST files are SNP-based measures
phistats files are haplotype-based measures

Let’s run rsync again to pull in the files from what we just ran into our desktop directory:

# transfer from the cluster
rsync -avhP -e 'ssh -p 8022' \
  barratt@localhost:/work/$USER/ddRAD-seq_workshop/outputs/ \
  /Users/cb76kecu/Desktop

Genetic diveristy and inbreeding statistics per population

We’ll first focus on the SNP-based measures using the {r}populations.sumstats_summary.tsv file:

# Read in .dat file (fstat) format needed by Adegenet package for running DAPC
populations.sumstats_summary.tsv <- read.delim("/Users/cb76kecu/Desktop/ddRAD-seq_workshop/outputs/Exercise_8/populations.sumstats_summary.tsv",
    header = FALSE)

Look at this object in R and you can see various estimates for each population based on the data and population mape used to run populations. Keep in mind these would change if you feed Stacks a different population map (e.g. fewer individuals) or different parameters for completeness of data (e.g. -p flag in populations).

Perhaps most interesting of these stats for each populations includes: * ‘Private’ (# of private alleles) * ‘Exp_Het’ (Expected heterozygosity) * ‘Obs_Het’ (Observed heterozygosity) * ‘Pi’ (a measure of nucleotide diversity) * ‘Fis’ (inbreeding co-efficient - higher = a more inbred population)

These data cam be used for plotting genetic diversity maps for example, or as response variable for statistical tests on drivers of genetic diversity.

Pairwise genetic differentiation (FST)

Secondly if you’re more interested in how well populations are differentiated from one another rather than their genetic diversity per se, we can leverage the summary file {r}populations.fst_summary.tsv file:

# Read in .dat file (fstat) format needed by Adegenet package for running DAPC
populations.fst_summary.tsv <- read.delim("/Users/cb76kecu/Desktop/ddRAD-seq_workshop/outputs/Exercise_8/populations.fst_summary.tsv",
    header = FALSE)

Look at this object in R and you see a pairwise site matrix of FST values between each defined population (higher numbers are more genetically differentiated from one another). Again this can be plotted visually if you want to, or used in statistical tests such as finding environmental drivers of population genetic differentiation.