The challenge of kmer-based basecalling

As we discussed in a previous post, the Oxford Nanopore system reads DNA in single-stranded  5-mers which are ratcheted through a nanopore one base at a time. The resulting raw electrical signals are measured and classified as ‘events’. These events are then sent to the Metrichor cloud-based basecaller to transform the events into individual base-calls.

Using our own dataset I’ve extracted some of the event measurements (thanks to Nick Loman’s burgeoning toolset for saving some work) for different types of kmer. I wanted to illustrate the challenges of trying to basecall in this brave new world.   First – let’s look at something which looks straightforward but actually isn’t when you’re dealing with kmer-based models.

Let’s assume we want to sequence a strand of DNA with a sequence TTTTTATTTTT. Let’s also assume that we’re particularly interested in detecting the A in the middle. At each ratchet of the enzyme feeding the DNA into the pore, the pore will be reading:

t=1: TTTTT

t=2: TTTTA

t=3: TTTAT

t=4: TTATT

t=5: TATTT

t=6: ATTTT

t=7: TTTTT

Similarly if we want to sequence TTTTTCTTTTT and we’re particularly interested in the C we’ll have:

t=1: TTTTT

t=2: TTTTC

t=3: TTTCT

t=4: TTCTT

t=5: TCTTT

t=6: CTTTT

t=7: TTTTT

 

difficult

Note that in reality the time steps are not evenly spaced. There’s stochasticity and I suspect a kmer-dependence of the transolation rate. Plotting the current against time demonstrates the potential difficulties we might have in distinguishing these two sequences even in the idealised case of even time steps.  This is a very extreme case since the two sequences are all but identical except for a single base, but I’ve used it to illustrate the point.

Other cases don’t seem quite as problematic. Here’s are the comparable traces for a more complex sequence.

easier

 

Looking at these plots it is remarkable is that despite the noise the MinIon is still capable of generating useful data. Base-calling is going to be a hot topic of research.

2D basecalling should help in that the template read will have  a different kmer composition to the complement at the same locus. Any increase in the accuracy of 1D basecalls (i.e. what we have above) should improve the accuracy 2D basecalls to an even greater proportion since we have two shots at reading the same part of the genome.

At the moment the 2D basecalling doesn’t seem particularly sophisticated and will insert Ns wherever there are disagreements between template and complement reads. Please correct me if I’m wrong here.

 

Leave a comment