Bioinformatics

In many cases, microdroplet workflows are paired with data analysis using next-generation sequencing (NGS). The high-throughput capabilities of these technologies make them a natural complement to one another. Here we describe several bioinformatic tools in use or under development in the Abate Lab for microdroplet experimental analysis.

Barcode Clustering

In single-cell experiments, short oligonucleotides co-encapsulated with cells serve as unique cellular barcodes. Analysis of NGS data begins by grouping raw reads according to their barcode sequence in order to recapitulate the genomic signatures of single cells. During this deconvolution step, barcode sequences which had been mutated - whether by PCR during library preparation or on the sequencer itself - are not properly grouped. Typically, these mutations are single-point, and, for sufficiently large barcoding spaces, we can correct for these small errors by comparing the sequences of all barcodes in an experiment.

 
Illustration of mutated barcode groups [red] and isolated barcodes [green] (from Shahi et al., 2017)

Illustration of mutated barcode groups [red] and isolated barcodes [green] (from Shahi et al., 2017)

 

If two barcodes share similar sequences, errors during library amplification and sequencing can create spurious groups comprising reads from the two originally distinct groups; this, in turn, can mix reads from multiple cells in a group that appears genuine. To remove these groups and consider only single-cell data, barcode sequences are processed to extract “well isolated” clusters using an algorithm that compares all barcodes, clustering groups that are within a Hamming distance of one. This generates two types of clusters: ones containing a single barcode sequence which is well isolated and greater than one Hamming distance to any other barcode group (green points), and others that contact multiple groups (red clusters). Clusters of multiple mutated barcodes can now be excluded from any downstream analysis.


De Novo Genome Assembly

A primary challenge in metagenomics is combining short reads into long, contiguous sequences of nucleic acid, a process known as “genome assembly.” The goal of de novo assembly is to build a new reference genome, the complete sequence of an organism’s DNA. Accurate reference genomes are highly valued in metagenomics because they serve as blueprints of cellular identity, which can be used to analyze the microbial profile of samples by short-read alignment. Rare species and highly similar strains are difficult, if not impossible, to resolve using typical shotgun sequencing assembly methods. One solution is using barcoded read data, whether it be from a single long molecule or whole cell, to guide the assembly process in these problematic cases.

 
Using connectivity information from single-cell barcodes, contigs from different bacterial species cluster together following dimensionality reduction

Using connectivity information from single-cell barcodes, contigs from different bacterial species cluster together following dimensionality reduction

 

SiC-seq produces low-coverage genomic data from single cells which can be used to perform bacterial genome scaffolding. Contigs from metagenomic experiments are typically grouped using differential binning methods, or using sequence-based metrics such as tetranucleotide frequency and codon usage. While useful, these methods are limited because they are performed strictly in silico. The SiC-seq microfluidic workflow, on the other hand, compartmentalizes and barcodes each cell individually, preserving true single-cell data. These barcoded fragments of nucleic acid provide an additional layer of read connectivity information which can be combined with existing bioinformatic methods to enhance the assembly of rare genomes and highly similar bacterial strains.


 

Visit the Abate Lab on GitHub! GitHub Logo