Recovery algorithm details

Bead Matrix

Within slideseq dataset the left end of the read contains 42 base-pairs. Each of this 42 bp sequence is structured in a pre-defined arrangement known as bead-matrix. The arrangement dictates which base-pairs to consider while forming the bead barcode and the UMI.

In our experiment the bead-matrix is ‘8C18X6C1X8M1X’

'C' - bead barcode
'X' - discard
'M' - UMI barcode

According to the previous structure we have the following scenario, where the bead-barcode consists of 8+6 = 14 base-pairs. There is 18 base-pair spacer between the two different barcodes. This spacer should be fixed and placed in the predefined location (in this case 9th position) but due to noise and quality of the beads the spacer can have noisy bases and is shifting from it’s original position.

For example in the beads we see

image0

ATGCGAATTCTTCAGCGTTCCCGAGACGATCAGCCATTGCTT
AGCGCCTCTTCAGCGTTCCCGAGACGGTTACTTCCACTAGAT -- If there are more than one shift we can not much (not recoverable)
ATTCGGTTCTTCAGCGTTCCCGAGATACCGCTGTGGGTTTTT -- There is possibility of recovery
GACCTAATCTTCAGCGTTCAGTCATAATCCCACAGATGGTAC -- no
GGGGTCCTTCTTCAGCGTTCCCGAGACTTAGTTCAGCGTCTT --
CGCAGACTTCTTCAGCGTTCTTTTTAAAATATCTGTAGAGGC
TCCGTCCCTCTTCAAAGCAGGCCCGAGCCGCCTGGATACCGC
AGTCAATTTCTTCAGCGTTCCGAGAGCTAAACGGCTGCTTTT
CTTCCAATTTTTCAGCGTTCCCGGACCAAGCAAGTTAGTCTT

70K barcode of 14bp long — Exact match to create a 60K whitelist

Johan has prepared a very comprehensive comparison of the base qualities. image1

One thing to notice is that the quality drops a lot therefore we need to correct these errors by algorithmic means.

Current Algorithm

Index Building

image2

The whitelist contains around 70K barcodes. The data from Johan contains around 118M reads. The index is organized with two hash tables. First one containing the first 8bp part of the barcodes and the second hash table that contains the 6bp. We build a connection between them via their co-occurrence in the white list.

Building the whitelist consists of two parts. In the first part we filter out the reads that has a perfect match with the spacer in the expected position (start position 8). We know that the bead barcode consists of the first 8bp and then the 6bp after the spacer. The initial whitelist is built by clipping this 14bp from the reads filtered in the previous step. The expected number of bead barcodes in a puck is ~70K. However In most of the cases because of noise, we end up with more bead barcodes. In order to extract the expected number of bead barcodes, we built a frequency table for these barcodes as shown in Figure 1b, the y-axis shows the frequency of the bead barcodes. The horizontal line shows the frequency cutoff we can use and the resulting number of unique bead barcodes when using such a threshold.

The final whitelist consists of around ~70K bead barcodes.

Out of all reads in our data 38% matches exactly with one of the barcodes in the whitelist (out of these around 3% has a wrong spacer sequence). The aim of the alignment algorithm is to recover the rest 62% reads.

The alignment algorithm uses different heuristics and can be parameterized to tune the number final reads that can be recovered. Alignment algorithm made use of the structure shown in Figure 1c.

The recovery algorithm starts with the alignment of the spacer to the actual read. A typical spacer alignment is shown in Figure 2.

image3

Using this method we recover around 7% of the noisy reads. This leads us to 45% 56% of the beads to recover.

Tuning the allowable rate

There are a couple of different ways to make this algorithm flexible to allow more number of beads at the expense of accepting reads that are less sensitive to the substitution noise.

1. We can increase the size of the whitelist to add more bead barcodes in the index. Increasing the index size accommodate wide range of mapping.
2. Increasing the edit distance threshold.
3. Increasing the length of the reference used to search for the second part.

Instead of stopping at the expected position of the UMI the search can further progress to the end of sequence.

The above mentioned algorithm is implements in scripts/recoverbam.py