Table of Contents
“I'm all in favour of keeping dangerous weapons out of the hand of fools. Let's start with typewriters.” | ||
--Solomon Short |
...
Life's a rollercoaster and there are days - or weeks - where moral is on a pretty hefty ride: ups and downs in fast succession ... and the occasional looping here and there.
Today was a day where I had - the first time ever - ups and downs occuring absolutely simultaneously. Something which is physically impossible, I know, but don't tell any physicist or astronomist about that or else they'll embark you on a lengthy discussion on how isochronicity is a myth by telling you stories on lightning, thunder and two poor sobs at the ends of a 300,000 km long train. But I digress ...
So, my lowest low and highest high today were at 09:17 this morning when I prepared leaving for work (hey, it's vacation time, almost everyone else is out and I can go a bit later than usual, right?). A few minutes earlier I had just told MIRA to run on the very same PacBio test set she had successfully worked on the night before to see how stable assemblies with this kind of data are (quite well so far, thank you for asking).
Reaching out to switch off the monitor and leave, MIRA suddenly came back with a warm and cosy little error message which she's taken the habit lately to have a mischievous pleasure to present. This time, she claimed there had been an illegal base in the FASTQ file.
“Hey, MIRA, wait a minute!” I thought. “Yesterday and tonight you ran on the very same data file with the very same parameters for two times three hours and even gave me back some nice assembly results. And now you claim that the INPUT data has errors?! Come on, you're not serious, are you?”
As a side note: she then just gave me back that look, you know, the one with those big open eyes almost hidden behind by long, dark lashes ... and slightly flushed cheeks accompanied by pointed lips. As if she wanted to say “I *am* innocent and *I* did no nothing wrong, you disbeliever!” (http://24.media.tumblr.com/tumblr_lj3efmmDL01qasfhmo1_400.png). This usually announces a major pouting round of hers, something which I'm not looking forward to, I can tell you.
Two restarts later with the same negative result (MIRA can be quite stubborn at times) I had to give in and decided to sit down again and investigate the problem.
“So ... read number 317301 at base position 246, eh? Let's have a look."”
*clickedyclick*
“Read 317299, 317300 ... 317301 ... there we are.”
*hackedyhack*
“Base position 239, 240 ... now: C G G G T C F
A A
... wait! What? 'F' ... 'F'?!? That's
not even an IUPAC code. What's a frakking 'F' doing in the FASTQ input
file?!” (see http://www.youtube.com/watch?v=r7KcpgQKo2I, conditionally safe
for work).
Indeed, 'F' is not IUPAC. Even more mysterious to me was the fact that just the night before it apparently had not been there. Or had it? I now was pretty unsure where this path would lead me, as if I had unlocked a door with the key of imagination. Beyond it: another dimension - a dimension of sound, a dimension of sight, a dimension of mind. I was moving into a land of both shadow and substance, of things and ideas. I just crossed over into ... the Twilight Zone ("G#-A-G#-E-G#-A-G#-E" at 128 bpm, for more info see http://www.youtube.com/watch?v=zi6wNGwd84g).
Where was I? Ah, yes, the 'F'.
So, how did that 'F' appear in the FASTQ, and where had it been the night before? Out to town, ashamed of not being a nucleotide and getting a hangover without telling anyone up-front? Or did it subreptitiously sneak in from the outside, murdering an innnocent base and taking its place in hope no one would note? I didn't have the slightest clue, but I was determined to find that out.
First thing to check: the log files of the successful runs the previous night. MIRA's very chatty at times and tidying up after her has always been a chore (there, I'm feeling that look again in my back), but now was one of those occasions where not gagging her paid out as poking around the files she left behind proved to be interesting. Read 317301 showed the following at the position in doubt: "C G G G T C ___G___ A A" Without question: a 'G', and no 'F' in sight!
So MIRA had been right and the 'G' in the sequence of the file mysteriously mutated into an 'F' overnight. I must admit that I had grown suspicious of her in the past few weeks as she had seemed to become uncooperative at times. In particular she had been screaming at me a couple of times during rehearsal of combined 454 and Illumina assemblies for the premiere of her new 3.4.0 show. She claimed that some uninvited spider monkeys (see http://dict.leo.org/ende?search=Klammeraffe) had frightened her so much she refused to continue to work and simply scribbled the '@' sign all over her error messages. I had not been able to find out how those critters entered MIRA's data and had even enrolled a few volunteers to rehearse different assemblies with MIRA ... to no avail as she'd performed without flaws there.
While reconsidering all these things, something suddenly made *click*.
The character 'G' has the hexadecimal ASCII table code 0x47 (or in 8-bit binary: 01000111). 'F', as preceding character of 'G' and the table having some logic behind it, has the hex code 0x46, which is 01000110 in 8-bit binary.
The ATINSEQ-bug (@-in-seq) I had been desperately hunting in the past few weeks (and which had held up the release of MIRA 3.4.0) was due to the "@" character sometimes mysteriously appearing in sequences during the assembly of MIRA. The '@' sign in the ASCII table has the hex code 0x40 (binary: 01000000). In the ASCII table, there is one important character for DNA assembly which is very near to the '@' character ..., so near that it is the successor of it: the 'A' character. Hexadecimal 0x41, binary 01000001.
I had always thought that a bug in MIRA somehow corrupted the sequence, but what if ... what if MIRA was actually really innocent?! I had never taken this possibility into account as any other explanation attempt would have seemed too far stretched.
But now I had a similar effect *outside* of MIRA, in the Linux filesystem!
The difference between the characters is in both cases exactly 1 bit which changes, and it's even at the same position (last one in a byte) and changing into the same direction (from '1' to '0').
I was now sure I was on to something: bit decay (see http://en.wikipedia.org/wiki/Bit_rot)
But how could I prove it? Well, elementary my dear Watson: When you have eliminated the impossible, whatever remains, however improbable, must be the truth.
Suspects:
Facts:
Deduction 1:
Further facts:
Deduction 2:
Further facts:
Deduction 3:
From deductions 1, 2 & 3 follows:
Suspects left:
Well, that can be easily tested: shut down the computer, restart it and subsequently look at the file again. No file cache in RAM can survive that procedure. Yes, I know, there are some magic incantations one can chant to force Linux to flush all buffers and clear all caches, but in that situation I was somehow feeling conservative.
Lo and behold, after the above procedure the FASTQ file showed an all regular, good old nucleic acid 'G' in the file again. No 'F' to be seen anywhere.
Deduction 4:
Deduction 5:
As I wrote: life's a rollercoaster.
Have a nice Friday and a good week-end,
Bastien
PS: while celebrating with MIRA tonight, I expressed my fear that some people might find it strange that I anthropomorphise her. They could think I went totally nuts or that I needed an extended vacation (which I do btw). She assured me that no one would dare thinking I were insane ... and if so, she would come over to their place and give them that look.
How utterly reassuring.