The Grand Locus / Life for statistical sciences

Why do bioinformatics?

I never planned to do bioinformatics. It just happened because I liked to spend time in front of my computer and my boss was OK with it. Still, as every sane individual, I sometimes think that I should do something else with my life, and I wonder whether I am doing the right thing. On this topic, I recently came across the famous farewell to bioinformatics by Frederick J. Ross, which is worth reading, and from which the most emblematic quote is the now celebrated aphorism

Fuck you, bioinformatics. Eat shit and die.

There is nothing to agree or disagree in this quote, but Frederick gives further detail about his point of view in the post. In short, bioinformaticians are bad programmers, and community-level obfuscation maintains the illusion.

By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques and the slowest languages, by not publishing their algorithms and making their results impossible to replicate, the field managed to reduce its productivity by at least 90%, probably closer to 99%.

There are indeed many issues in the bioinformatics community and I am on Frederick’s side regarding file formats. For instance, I have huge respect for the maintainers of the BAM/SAM format, but here is a quote, straight from the documentation*.

Structure for core alignment information.
typedef struct { 
    int32\_t tid; 
    int32\_t pos; 
    uint32\_t bin:16, qual:8, l_qname:8; 
    uint32\_t flag:16, n_cigar:16; 
    int32\_t l_qseq; 
    int32\_t mtid; 
    int32\_t mpos; 
    int32\_t isize; 
} bam1\_core\_t;
Fields
tid
   chromosome ID, defined by bam\_header\_t
pos
   0-based leftmost coordinate
strand
   strand; 0 for forward and 1 otherwise
bin
   bin calculated by bam\_reg2bin()
qual
   mapping quality
l\_qname
   length of the query name
flag
   bitwise flag
n\_cigar
   number of CIGAR operations
l\_qseq
   length of the query sequence (read)

You do not need to know anything about C to notice that the description does not match. At some point, the core storage format of BAM has changed (just that!) and the old documentation got mixed up with the new one. So much for a planetary standard.

But no discussion of bioinformatics nonsense would be complete without a benchmark section. In our last software article, we were asked to run our benchmark against an all-pairs algorithm called slidesort. The original benchmark of slidesort concealed two minor details: that it takes months to return, and that it is not an all-pairs algorithm. The email of the maintainers being obsolete, we had to put some effort into finding the authors to ask for explanations. The answer was that it was probably a bug. But “bug” is too polite, “software pollution” is more appropriate.

... so why do bioinformatics?

The answer is simple: because it matters. Even though I deeply agree with Frederick, not everything boils down to working with skillful people. The impact of bioinformatics is unacknowledged but visible. How many discoveries started with a BLAST search? How many experiments were possible only because the human genome is sequenced? Besides, not every problem in bioinformatics is about memory footprint and CPU cycles; in some cases there are lives at stake. Choosing a treatment for cancer patients, deciding upon an abortion based on genotype data, initiating a vaccination campaign... and so much more.

Bioinformatics is biology, and it matters.

__Notes:__ * The text has since been updated.



blog comments powered by Disqus