The Pediatric Brain Tumor Atlas is a collaborative effort to accelerate discoveries for therapeutic intervention for children diagnosed with a brain tumor. The first dataset release of the The Pediatric Brain Tumor Atlas (PBTA) occurred September 10th, 2018. In this release there are over 30 different types of pediatric brain tumors representing over 1000 subjects. Data types include match tumor/normal, whole genome data (WGS), RNAseq, proteomics, longitudinal clinical data, Imaging data including MRIs and radiology reports, histology slide images and pathology reports. The CBTTC promotes real time data release with no embargo period.
To search the PBTA files and associated meta data please visit the Kids First DRC portal. There you will be able to identify genomic and associated files and seamlessly integrate them into a Cavatica workspace for research on the raw genomic data. The PBTA summary data is available on PedscBioPortal for analytic visualization.
Researchers can request access to the raw genomic data by completing and submitted the CBTTC Data Access Request form.
A digital object identifier has been given to the PBTA and can be used for reference. https://doi.org/10.24370/SD_BHJXBDQK
Funding for this initiative was provided by over 50 foundation sponsors.
Genomic Harmonization details for the PBTA
The Kids First Data Resource Center (DRC) has developed and applied alignment and joint genotyping workflows following by the GATK Best Practice recommendations with the goal of being functionally equivalent with other current large genomic research efforts. The data processing is done via the Cavatica platform within an Amazon Web Services (AWS) environment. The harmonized results are stored in AWS and made searchable via the DRC Portal and further analyzable on Cavatica.
In more detail, the harmonization process starts with an alignment workflow that takes input in the format of either BAM, CRAM or FASTQ or mixed, and converted them into uBAM (the unmapped BAM) by Picard RevertSam. After that, the uBAM will be aligned, by readgroup, to human genome reference hg38, which includes improved ALT contigs and HLA loci. Then Picard MarkDuplicates, SamSort and MergeBamAlignment will be applied in a scatter execution fashion where jobs will be parallelized by split chromosome intervals. BQSR or the Base Quality Score Recalibration process is then applied based on model from the known SNPs and InDels of HapMap, 1000 Genomes, dbSNP138 and Mills Gold Standard Calls. Lastly GATK4 HaplotypeCaller will be applied to generate single sample gVCF along with the merged BAM converted into CRAM as final alignment outputs.
For the joint genotyping workflow, trio-based and cohort-based gVCFs will be imported as genomicsDB by GATK4 and passed down for GenotypeGVCFs execution. VQSR will be applied for SNP and InDel separately in a scatter fashion by calling intervals. A final VCF with a QC report will be performed by GATK4 GatherVcfs and CollectVariantCallingMetrics. Then all the outputs are registered into the Kids First Data Service for tracking and final checking of results and after approval are released to the DRC Portal and Cavatica.
Kids First DRC pipelines are open source and made available to the public via GitHub:
Alignment workflow: https://github.com/kids-first/kf-alignment-workflow
Joint genotyping workflow: https://github.com/kids-first/kf-jointgenotyping-workflow