Tutorial: Downloading data

Timings

Download Myotis escalerai (Escalera’s bat) data (ca. 25 min per library - F+R)
un-bzip the files and gzip them so they are ready for Stacks (ca. 30 min)

All done in EVE cluster

Objectives

Sometimes you may wish to download some already published data. Here we will download some data from the European Nucleotide Archive for individuals of a European bat species, Myotis escalarai. The data are from paper published a couple of years ago here and available to browse in the ENA here. The raw data are 5 libraries of dd-RADseq data, I have provided the first library for you in the /ddRAD-seq_workshop/data/Exercises_1-2/raw/ folder (BAT_01_NoIndex_L008_R1_001.fastq.gz, BAT_01_NoIndex_L008_R2_001.fastq.gz). Because these are paired-end sequencing data, For each library there are 2 files - a forward reads file (R1) and a reverse reads file (R2). They need to be downloaded and then un-bzipped and gzipped, as this is the format that Stacks needs.

Try and download one or two of the following remaining libraries (both R1 and R2 files) to practice yourself:

BAT_02_NoIndex_L005_R1_001.fastq.gz
BAT_02_NoIndex_L005_R2_001.fastq.gz
BAT_03_NoIndex_L004_R1_001.fastq.gz
BAT_03_NoIndex_L004_R2_001.fastq.gz
BAT_04_NoIndex_L005_R1_001.fastq.gz
BAT_04_NoIndex_L005_R2_001.fastq.gz
BAT_05_NoIndex_L006_R1_001.fastq.gz
BAT_05_NoIndex_L006_R2_001.fastq.gz

1. Download ENA data and get them in correct format for Stacks

This is all done in the EVE cluster. As with all of the EVE work we will do, it can be submitted via sbatch using the script template (01_download_data.sh) we provide, but be sure to modify the email address and output paths to match your own details.

In the script headers (here and in all other scripts used in EVE):

Always change $USER in all scripts to your own username (the one you login with)
Always change the christopher_david.barratt@uni-leipzig.de to your own email

#!/bin/bash       
                                                                                                                     
#SBATCH --job-name=download_ENA_data
#SBATCH --mail-user=christopher_david.barratt@uni-leipzig.de
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#SBATCH --output=/work/$USER/ddRAD-seq_workshop/job_logs/%x-%j.log
#SBATCH --cpus-per-task=1 
#SBATCH --mem-per-cpu=15G
#SBATCH --time=04:00:00

# download ENA Data

cd /work/$USER/ddRAD-seq_workshop/data/Exercises_1-2/
cd ./raw

# Myotis escalerai bat data example download == 134 inds ddRADseq Paired end sequenced data, sbfI and nlaIII enzymes. Barcodes are inline inline adapters
#https://www.ebi.ac.uk/ena/browser/text-search?query=PRJEB29086


# I've commented out all except library 1 here, try downloading different files

wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR283/ERR2832479/BAT_01_NoIndex_L008_R1_001.fastq.bz2
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR283/ERR2832479/BAT_01_NoIndex_L008_R2_001.fastq.bz2
#wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR283/ERR2832480/BAT_02_NoIndex_L005_R1_001.fastq.bz2
#wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR283/ERR2832480/BAT_02_NoIndex_L005_R2_001.fastq.bz2
#wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR283/ERR2832481/BAT_03_NoIndex_L004_R1_001.fastq.bz2
#wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR283/ERR2832481/BAT_03_NoIndex_L004_R2_001.fastq.bz2
#wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR283/ERR2832482/BAT_04_NoIndex_L005_R1_001.fastq.bz2
#wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR283/ERR2832482/BAT_04_NoIndex_L005_R2_001.fastq.bz2
#wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR283/ERR2832483/BAT_05_NoIndex_L006_R1_001.fastq.bz2
#wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR283/ERR2832483/BAT_05_NoIndex_L006_R2_001.fastq.bz2

# these are bzip2ed, so need to be unzipped and then gzipped for Stacks
ml GCCcore/10.2.0
ml bzip2/1.0.8

gunzip ./*.bz2.gz
bzip2 -d ./*.bz2
gzip ./*.fastq