Why do bioinformatics?
By Guillaume Filion, filed under
software pollution,
benchmark,
bioinformatics.
•
•
I never planned to do bioinformatics. It just happened because I liked to spend time in front of my computer and my boss was OK with it. Still, as every sane individual, I sometimes think that I should do something else with my life, and I wonder whether I am doing the right thing. On this topic, I recently came across the famous farewell to bioinformatics by Frederick J. Ross, which is worth reading, and from which the most emblematic quote is the now celebrated aphorism
Fuck you, bioinformatics. Eat shit and die.
There is nothing to agree or disagree in this quote, but Frederick gives further detail about his point of view in the post. In short, bioinformaticians are bad programmers, and community-level obfuscation maintains the illusion.
By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques and the slowest languages, by not publishing their algorithms and making their results impossible to replicate, the field managed to reduce its productivity by at least 90%, probably closer to 99%.
There are indeed many issues in the bioinformatics community and I am on Frederick’s side regarding file formats. For instance, I have huge respect for the maintainers of the BAM/SAM format, but here is a quote, straight from the documentation*.
Structure for core alignment information. typedef struct { int32\_t tid; int32\_t pos; uint32\_t bin:16, qual:8, l_qname:8; uint32\_t flag:16, n_cigar:16; int32\_t l_qseq; int32\_t mtid; int32\_t mpos; int32\_t isize; } bam1\_core\_t;
Fields tid chromosome ID, defined by bam\_header\_t pos 0-based leftmost coordinate strand strand; 0 for forward and 1 otherwise bin bin calculated by bam\_reg2bin() qual mapping quality l\_qname length of the query name flag bitwise flag n\_cigar number of CIGAR operations l\_qseq length of the query sequence (read)
You do not need to know anything about C to notice that the description does not match. At some point, the core storage format of BAM has changed (just that!) and the old documentation got mixed up with the new one. So much for a planetary standard.
But no discussion of bioinformatics nonsense would be complete without a benchmark section. In our last software article, we were asked to run our benchmark against an all-pairs algorithm called slidesort. The original benchmark of slidesort concealed two minor details: that it takes months to return, and that it is not an all-pairs algorithm. The email of the maintainers being obsolete, we had to put some effort into finding the authors to ask for explanations. The answer was that it was probably a bug. But “bug” is too polite, “software pollution” is more appropriate.
... so why do bioinformatics?
The answer is simple: because it matters. Even though I deeply agree with Frederick, not everything boils down to working with skillful people. The impact of bioinformatics is unacknowledged but visible. How many discoveries started with a BLAST search? How many experiments were possible only because the human genome is sequenced? Besides, not every problem in bioinformatics is about memory footprint and CPU cycles; in some cases there are lives at stake. Choosing a treatment for cancer patients, deciding upon an abortion based on genotype data, initiating a vaccination campaign... and so much more.
Bioinformatics is biology, and it matters.