Short-read sequencing has enabled the de novo assembly of a few individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays, and generate a de novo assembly of 2.93Gb (contig N50: 8.3Mb, scaffold N50: 22.0Mb, including 39.3Mb N-bases), together with 206Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8Mb of HX1-specific sequences, including 4.1Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.
|Material||Platform||# Cells||# Reads||Bases||Coverage||Mean length||N50 length|
|DNA||Illumina HiSeq X||-||2.8 billion reads||428.8 G||143X||151||151|
|DNA||PacBio SMRT cell||377 cells||44.2M reads||309.0G||103X||7.0Kb||12.1Kb|
|DNA||BioNano IrysChip||12 cells||1.169M molecules (>150kb)||302.8G||101X||259.0Kb||224.7Kb|
|RNA||PacBio SMRT cell||50 cells (1-2kb, 2-3kb, 3-5kb,5kb+)||2.721M error-corrected reads||5.8G||-||2.1Kb||2.7Kb|
|RNA||Illumina HiSeq 2500||NA||48.9M reads||4.4G||-||90||90|
Raw data can be accessed from SRA.
non-GRCh38 sequences in HX1
non-GRCh38 non-YH2.0 sequences in HX1
CNV and SV calls on BioNano, SMRT long read, Illumina short read, and microarray data
Whole genome alignment (hg38 full analysis + decoy vs hx1f4full_3rdfixedv2) by MUMmer3
Kai Wang: email@example.com