Tutorial: Admixture

Timings

Run admixture (in EVE cluster, ~3 min) and then copy results to your local machine
Load data (< 1 min)
Plot CV scores (< 1 min)
Make barplot of Admixture results (< 1 min)

Step 1 done in EVE cluster, Steps 2-4 done locally on your machines

Objectives

For this exercise and the remainder of the workshop we will use some published amphibian data to look at population structure, phylogeny and genetic diversity estimates. The data are for Leptopelis flavomaculatus, a forest tree frog from Tanzania, and were published here.

For this exercise in particular using Admixture software, we will learn how to go from a Stacks output file to investigate population structure, find statistical support to show the ‘best’ explained population structure (i.e. number of clusters), and then plot a nice barplot to show population assignment for each individual.

1. Run Admixture (on the EVE cluster)

#This is all done in the EVE cluster, you can make your own scripts to do this or use the one provided (04_Admixture.sh). It can be submitted via sbatch in the EVE cluster, but be sure to modify the email address and output paths to your own details.

It’s a good idea to add a random seed generator to admixture calls, e.g. with the -s time addition to your code which will start the seed based on the current time (so that you always start from a different seed)

#!/bin/bash                                                                                                                            

#SBATCH --job-name=Admixture
#SBATCH --mail-user=christopher_david.barratt@uni-leipzig.de
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#SBATCH --output=/work/$USER/ddRAD-seq_workshop/job_logs/%x-%j.log
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=15G
#SBATCH --time=2:00:00

 
#load environment
module purge
module load Anaconda3
source activate /global/apps/plink_1.90


# Converting ped and map files to bed format for Admixture
plink --file /work/$USER/ddRAD-seq_workshop/data/Exercises_4-8/Admixture/Lflav --make-bed --recode --out /work/$USER/ddRAD-seq_workshop/outputs/Exercise_4/Lflav

#run admixture
cd /work/$USER/ddRAD-seq_workshop/outputs/Exercise_4/
bash
source activate /gpfs0/global/apps/admixture_1.3.0
for K in 1 2 3 4 5 6 7 8 9 10; \
do admixture --cv=10 ./Lflav.bed $K| tee log_Lflav.${K}.out; done

# grep to combine all results
grep -h CV log_Lflav*.out > ./RESULTS_Leptopelis_flavomaculatus_admixture.txt

#sed to clean results for later analysis
sed -e 's/CV error (K=//g' ./RESULTS_Leptopelis_flavomaculatus_admixture.txt | sed 's/)://g' > ./RESULTS_Leptopelis_flavomaculatus_admixture_clean.txt

# replace all single digit numbers to begin with 0s for numerical sorting
sed -e 's/1 /01 /g' ./RESULTS_Leptopelis_flavomaculatus_admixture_clean.txt \
| sed -e 's/2 /02 /g' \
| sed -e 's/3 /03 /g' \
| sed -e 's/4 /04 /g' \
| sed -e 's/5 /05 /g' \
| sed -e 's/6 /06 /g' \
| sed -e 's/7 /07 /g' \
| sed -e 's/8 /08 /g' \
| sed -e 's/9 /09 /g'> ./tmp.txt\

# sort output file (01-10) and remove old temporary one
sort ./tmp.txt > ./RESULTS_Leptopelis_flavomaculatus_admixture_clean.txt
rm -rf ./tmp.txt

When the Admixture job is complete we need to copy the output files to our desktop. We can use Cyberduck/FileZilla for this (I’ll show you), or we can use rsync. To use rsync open up another version of your Terminal/PuTTy and type the following (edited to match your account details and local desktop path):

# transfer from the cluster
rsync -avhP -e 'ssh -p 8022' \
  barratt@localhost:/work/$USER/ddRAD-seq_workshop/outputs/ \
  /Users/cb76kecu/Desktop

2. Load data

Load the cleaned Admixture output file that we will analyse. This is a simple text file with the number of clusters (K) and the corresponding 10-fold cross-validation(CV) score. Be aware here and further on that directory separators are different between mac (/) and Windows (), so check you are using the correct ones if you are having trouble reading files

results <- read.delim("/Users/cb76kecu/Desktop/ddRAD-seq_workshop/outputs/Exercise_4/RESULTS_Leptopelis_flavomaculatus_admixture_clean.txt",
    sep = " ", header = FALSE)

3. Plot CV scores

We will first set the plot margins and then set an output file (png) with specific sizes and resolutions Then we plot the results (first points, then a line to join the points)

par(mar = c(2, 2, 2, 2))
png(filename = "/Users/cb76kecu/Desktop/ddRAD-seq_workshop/outputs/Exercise_4/Admixture_Results_CV_scores.png",
    type = "quartz", height = 8, width = 8, units = "in", res = 300)
plot(results, pch = 19, col = "black", cex = 0.25, xlab = "k clusters", ylab = "CV score",
    main = "Cross validation scores across k (Admixture")
lines(results, lty = 1, lwd = 1.5, col = "steelblue2")
dev.off()

Because this dataset is for relatively highly structured (low-dispersal) amphibians, the ‘elbow’ in the plot is very clear - with a lower CV score at K=4. This is not always the case for all datasets, so other approaches might be useful to use a few different methods to look at population structure (reviewers anyway generally like this approach).

4. Make barplot of Admixture results

Now that we know the optimal value of K we can plot the Admixture coefficients across all of our samples (n=59 in this case). We’ll read the Lflav.4.Q output file from Admixture and order it to make it look nicer. Each .Q file is a prediction of population membership per individual across the given number of clusters you decieded to test when running Admixture (i.e. 1:10). In the case of Lflav.4.Q It is a matrix of 59 rows (one per individual) and 4 columns (i.e. K=4). The probability of membership to each population cluster is shown for each individual. We’ll then plot it, add individual labels for each sample and then save the final plot.

# plot bars
tbl = read.table("/Users/cb76kecu/Desktop/ddRAD-seq_workshop/outputs/Exercise_4/Lflav.4.Q")
ord = tbl[order(tbl$V1, tbl$V3, tbl$V2, tbl$V4), ]
png(filename = "/Users/cb76kecu/Desktop/ddRAD-seq_workshop/outputs/Exercise_4/Admixture_Results_Barplot.png",
    type = "quartz", height = 8, width = 30, units = "in", res = 300)
labels <- read.csv("/Users/cb76kecu/Desktop/ddRAD-seq_workshop/data/Exercises_4-8/Admixture/pop_labels.csv",
    header = FALSE)
labels <- data.frame(labels[, 1])
ord_with_labels <- data.frame(ord[, 1:4], row.names = labels[, 1])
par(mar = c(20, 4, 1, 1))
bp = barplot(t(as.matrix(ord_with_labels)), space = (0.04), col = c("blue", "turquoise",
    "darkblue", "lightblue"), xlab = "", ylab = "Ancestry", border = 1, las = 2,
    cex.names = 0.9)
dev.off()

On your own…

If you want, try plotting k=1 - k=5, feel free to change colour schemes and the order of the tbl object. You can also edit the CV validation method t be –cv=5 for example when running Admixture. If you have your own data you can try and use this script to run Admixture and then plot your own outputs