One shade of authorship attribution
By Guillaume Filion, filed under
planktonrules,
Python,
machine learning,
R,
IMDB,
series: IMDB reviews,
automatic authorship attribution.
•
•
This article is neither interesting nor well written.
Everybody in the academia has a story about reviewer 3. If the words above sound familiar, you will definitely know what I mean, but for the others I should give some context. No decent scientific editor will accept to publish an article without taking advice from experts.
This process, called peer review, is usually anonymous and opaque. According to an urban legend, reviewer 1 is very positive, reviewer 2 couldn't care less, and reviewer 3 is a pain in the ass. Believe it or not, the quote above is real, and it is all the review consists of. Needless to say, it was from reviewer 3.
For a long time, I wondered whether there is a way to trace the identity of an author through the text of a review. What methods do stylometry experts use to identify passages from the Q source in the Bible, or to know whether William Shakespeare had a ghostwriter?
The 4-gram method
Surprisingly, the best stylistic fingerprints have little to do with literary style. For instance, lexical richness and complexity of the language are very difficult to exploit efficiently. The unconscious foibles, the recurrent mistakes and misuse of the punctuation much better betray their author because they tend to be writer invariant.
A very simple method to extract this information is to count all the 4-grams (the sequences of 4 characters) of a text. For instance the 4-grams of "to be or not to be" are 'to_b', 'o_be', '_be_', etc., and the 4-grams 'to_b', 'o_be' occur two times. The idea of this decomposition is that the most frequent words of a text will produce the most frequent 4-grams, and the most frequent mistakes will belong to several 4-grams.
In order to catch features such as punctuation errors, mis-capitalization, space omission or space doubling, it is important not to process the text in any way before collecting the 4-grams, which makes it much easier than standard Natural Language Processing (see The elements of style). Somewhat ironically, the stop words such as 'and', 'the' etc., usually filtered out for carrying no semantic content turn out to be the most informative. Every author uses the most common English words with a slightly different frequency, which then constitutes his/her fingerprint.
How good is this?
Remember planktonrules from The geometry of style? I scraped off the 4-grams from his reviews and collected the 1,000 most frequent as a feature set (planktonrules uses many double and triple spaces, and uses 'film' much more often than 'movie' contrary to most of the authors). I then used R to train a Support Vector Machine with a random selection of 10,000 reviews among a set of 50,000 and tested the model on the 40,000 remaining reviews (click on the Penrose triangle below to see the R and Python code). The accuracy of such a brutal approach is surprisingly high. The error rate is around 2%, with a false negative rate of 6.6% and a false positive rate of 0.3%.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
1 2 3 4 5 6 7 8 |
|
I tried several other classifiers (logistic regression, LDA, QDA and CART), but SVM always gave the best results. LDA gave a reasonable fit but the false negative rate never dropped below 15%, no matter how many 4-grams I would include. I also tried other feature sets, such as the most common 4-grams of the corpus (not necessarily used by planktonrules), and the ones for which the frequency is the most different between planktonrules and the rest of the writers but the results were not as good.
Does that mean that I can catch the author of the brilliant quote that introduced this post? Not very likely of course, because the text is very short. And also because I do not have a reference set for peer reviews. The example given above has it easy, in the sense that it is a binary classifier. Building such classifiers for a large set of authors must be substantially more difficult. But we can bet that Google has already done it. With access to your mail and everything you write in Google Docs/Google Drive, they probably have stylistic fingerprints for a large portion of the Internet community. I guess I should work on a stylistic fingerprint eraser then...