Skip to main content

Generate BAM/CRAM output given one or more pairs of FASTQ files by using Parabricks

Generate BAM/CRAM output given one or more pairs of FASTQ files by using Parabricks

fq2bam (FQ2BAM + BWA-MEM)

Parabricks fq2bam is a software tool that can be used to generate BAM/CRAM output from one or more pairs of FASTQ files. This tool takes advantage of the parallel computing capabilities of GPUs to speed up the analysis process.

To use Parabricks, users can provide input files in FASTQ format and specify the reference genome they wish to use for alignment. The software uses a proprietary algorithm to perform read alignment, variant calling, and quality control. The output is then generated in either BAM or CRAM format, depending on the user's preference.

In this case study, we use following dataset:

Reference Genome: human_g1k_v37.fasta

Sample Data Source: SRA SRR7733443

Number Of Read: 2 x 5M bp

Read length: 150bp 

We followed the steps below to collect the performance results:

  1. Download Samples 1.
    1. Download SRA toolkit from https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software#header-global
    2. tar xfzv sratoolkit.2.10.5-centos_linux64.tar.gz
    3. sratoolkit.2.10.5-centos_linux64/bin/fastq-dump -X 5000000 --split-files SRR9932168
  2. Download Reference 
    1. Download reference from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz 
  3. Reference Indexing
    1. use our parabricks container which already include samtools
    2. singularity run --nv /pfss/containers/clara-parabricks.4.0.0-1.sif bash samtools faidx /human_g1k_v37.fasta 
  4.  Download Known Site 
    1. download 00-common_all.vcf from https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/
    2. singularity run /pfss/containers/gatk.sif gatk IndexFeatureFile -I /00-common_all.vcf 
  5. Run Parabricks
    1. we use a100 here, be careful that parabricks currently not support MIG

    2. srun --pty -p gpu --cpus-per-task=16 --gres=gpu:a100:1 --mem=128G bash
      singularity run --nv $SCRATCH/containers/clara-parabricks.4.0.0-1.sif bash 

      NVIDIA_VISIBLE_DEVICES="0" pbrun fq2bam \
      --num-gpus 1 \
      --ref /human_g1k_v37.fasta \
      --in-fq /SRR9932168_1.fastq /SRR9932168_2.fastq \
      --out-bam /mark_dups_gpu.bam \
      --tmp-dir \
      --knownSites /00-common_all.vcf \
      --out-recal-file ~/output/recal_gpu.txt 
  6.  Result 
    1. Output should be something like this 
    2. main] CMD: /usr/local/parabricks/binaries//bin/bwa mem -Z ./pbOpts.txt /pfss/scratch02/appcara/parabricks/parabricks_sample/Ref/human_g1k_v37.fasta /pfss/scratch02/appcara/parabricks/sratoolkit.3.0.6-centos_linux64/SRR9932168_1.fastq /pfss/scratch02/appcara/parabricks/sratoolkit.3.0.6-centos_linux64/SRR9932168_2.fastq @RG\tID:SRR9932168.1.1\tLB:lib1\tPL:bar\tSM:sample\tPU:SRR9932168.1.1
      [main] Real time: 105.672 sec; CPU: 1343.564 sec
      [PB Info 2023-Aug-01 12:07:41] ------------------------------------------------------------------------------
      [PB Info 2023-Aug-01 12:07:41] ||        Program:                      GPU-BWA mem, Sorting Phase-I        ||
      [PB Info 2023-Aug-01 12:07:41] ||        Version:                                           4.0.0-1        ||
      [PB Info 2023-Aug-01 12:07:41] ||        Start Time:                       Tue Aug  1 12:05:55 2023        ||
      [PB Info 2023-Aug-01 12:07:41] ||        End Time:                         Tue Aug  1 12:07:41 2023        ||
      [PB Info 2023-Aug-01 12:07:41] ||        Total Time:                            1 minute 46 seconds        ||
      [PB Info 2023-Aug-01 12:07:41] ------------------------------------------------------------------------------
      [PB Info 2023-Aug-01 12:07:41] ------------------------------------------------------------------------------
      [PB Info 2023-Aug-01 12:07:41] ||                 Parabricks accelerated Genomics Pipeline                 ||
      [PB Info 2023-Aug-01 12:07:41] ||                              Version 4.0.0-1                             ||
      [PB Info 2023-Aug-01 12:07:41] ||                             Sorting Phase-II                             ||
      [PB Info 2023-Aug-01 12:07:41] ------------------------------------------------------------------------------
      [PB Info 2023-Aug-01 12:07:41] progressMeter - Percentage
      [PB Info 2023-Aug-01 12:07:41] 0.0       0.00 GB
      [PB Info 2023-Aug-01 12:07:51] Sorting and Marking: 10.001 seconds
      [PB Info 2023-Aug-01 12:07:51] ------------------------------------------------------------------------------
      [PB Info 2023-Aug-01 12:07:51] ||        Program:                                  Sorting Phase-II        ||
      [PB Info 2023-Aug-01 12:07:51] ||        Version:                                           4.0.0-1        ||
      [PB Info 2023-Aug-01 12:07:51] ||        Start Time:                       Tue Aug  1 12:07:41 2023        ||
      [PB Info 2023-Aug-01 12:07:51] ||        End Time:                         Tue Aug  1 12:07:51 2023        ||
      [PB Info 2023-Aug-01 12:07:51] ||        Total Time:                                     10 seconds        ||
      [PB Info 2023-Aug-01 12:07:51] ------------------------------------------------------------------------------
      [PB Info 2023-Aug-01 12:07:51] ------------------------------------------------------------------------------
      [PB Info 2023-Aug-01 12:07:51] ||                 Parabricks accelerated Genomics Pipeline                 ||
      [PB Info 2023-Aug-01 12:07:51] ||                              Version 4.0.0-1                             ||
      [PB Info 2023-Aug-01 12:07:51] ||                         Marking Duplicates, BQSR                         ||
      [PB Info 2023-Aug-01 12:07:51] ------------------------------------------------------------------------------
      [PB Info 2023-Aug-01 12:07:51] progressMeter -  Percentage
      [PB Info 2023-Aug-01 12:08:01] 0.0       4.38 GB
      [PB Info 2023-Aug-01 12:08:11] 0.0       4.38 GB
      [PB Info 2023-Aug-01 12:08:21] 0.0       4.38 GB
      [PB Info 2023-Aug-01 12:08:31] 0.0       4.38 GB
      [PB Info 2023-Aug-01 12:08:41] 0.0       4.38 GB
      [PB Info 2023-Aug-01 12:08:51] 3.4       4.28 GB
      [PB Info 2023-Aug-01 12:09:01] 100.0     0.00 GB
      [PB Info 2023-Aug-01 12:09:01] BQSR and writing final BAM:  70.031 seconds
      [PB Info 2023-Aug-01 12:09:01] ------------------------------------------------------------------------------
      [PB Info 2023-Aug-01 12:09:01] ||        Program:                          Marking Duplicates, BQSR        ||
      [PB Info 2023-Aug-01 12:09:01] ||        Version:                                           4.0.0-1        ||
      [PB Info 2023-Aug-01 12:09:01] ||        Start Time:                       Tue Aug  1 12:07:51 2023        ||
      [PB Info 2023-Aug-01 12:09:01] ||        End Time:                         Tue Aug  1 12:09:01 2023        ||
      [PB Info 2023-Aug-01 12:09:01] ||        Total Time:                            1 minute 10 seconds        ||
      [PB Info 2023-Aug-01 12:09:01] ------------------------------------------------------------------------------

       

fq2bam performs the following steps.

    fastq2bam_steps.png

    1. ![Untitled](Generate%20BAM%20CRAM%20output%20given%20one%20or%20more%20pairs%20o%20e4ac9eaca6f045bdb85982e7933e7c6d/Untitled.png)
      Compatible CPU-based BWA-MEM, GATK4 Commands

      ```


      #install bwa

      git clone https://github.com/lh3/bwa.git

      cd bwa

      make

      srun --pty -p batch --cpus-per-task=32  --mem=100G bash 
      singularity run /pfss/containers/gatk.sif bash

      ./bwa index <Your Ref directory>/human_g1k_v37.fasta

      ./bwa mem -t 32 -K 10000000 -R '@RG\tID:SRR9932168.1.1 \tLB:lib1\tPL:bar\tSM:sample\tPU:SRR9932168.1.1 ' \
      <Your Ref directory>/Ref/human_g1k_v37.fasta \
      <Your Sample directory>/SRR9932168_1.fastq \
      <Your Sample directory>/SRR9932168_2.fastq | \ 
      gatk SortSam \ 
      --java-options -Xmx30g \ 
      --MAX_RECORDS_IN_RAM 5000000 \ 
      -I /dev/stdin \ 
      -O <Your Output directory>/cpu.bam \ 
      --SORT_ORDER coordinate

      # for max spot id 5000000 spent 2.72 mins for sorting, 2.2 mins for convert to BAM

      [main] CMD: ./bwa mem -t 32 -K 10000000 -R @RG\tID:SRR9932168.1.1 \tLB:lib1\tPL:bar\tSM:sample\tPU:SRR9932168.1.1  /pfss/scratch02/appcara/parabricks/parabricks_sample/Ref/human_g1k_v37.fasta /pfss/scratch02/appcara/parabricks/sratoolkit.3.0.6-centos_linux64/SRR9932168_1.fastq /pfss/scratch02/appcara/parabricks/sratoolkit.3.0.6-centos_linux64/SRR9932168_2.fastq 
      [main] Real time: 151.130 sec; CPU: 3594.553 sec 
      INFO    2023-08-01 05:58:33     SortSam Finished reading inputs, merging and writing to output now. 
      INFO    2023-08-01 05:58:47     SortSam Wrote    10,000,000 records from a sorting collection.  Elapsed time: 00:02:43s.  Time for last 10,000,000:   14s.  Last read position: */* 
      [Tue Aug 01 05:58:47 GMT 2023] picard.sam.SortSam done. Elapsed time: 2.72 minutes. 
      Runtime.totalMemory()=1409286144 
      Tool returned:
      0

      0

      # Mark duplicates.

      gatk MarkDuplicates \ 
      --java-options -Xmx30g \ 
      -I <Your Output directory>/cpu.bam \ 
      -O <Your Output directory>/mark_dups_cpu.bam \ 
      -M metrics.txt

      # spend 1 mins

      # Generate a BQSR report.

      gatk BaseRecalibrator \ 
      --java-options -Xmx30g \ 
      --input <Your Output directory>/mark_dups_cpu.bam \ 
      --output <Your Output directory>/recal_cpu.txt \ 
      --known-sites <Your Known Site directory>/00-common_all.vcf \ 
      --reference <Your Ref directory>/human_g1k_v37.fasta

      # spend 1.68 mins

      ```

      One of the main advantages of Parabricks is its speed. The software is capable of analyzing large datasets in a fraction of the time it would take traditional tools to complete the same task. Additionally, Parabricks is highly scalable, meaning it can be used to analyze datasets of varying sizes without sacrificing performance.

      Overall, Parabricks is a powerful tool for generating BAM/CRAM output from one or more pairs of FASTQ files. Its combination of speed, accuracy, and scalability make it an ideal choice for researchers and scientists who need to process large amounts of genomic data quickly and efficiently.