1
0
Fork 0
mirror of https://git.sr.ht/~seirdy/seirdy.one synced 2024-12-18 06:42:10 +00:00
seirdy.one/content/posts/stylometric-fingerprinting-redux.gmi
2022-07-15 17:17:26 -07:00

157 lines
10 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This is an expanded version of a microblog entry:
=> https://seirdy.one/notes/2022/06/24/stylometric-fingerprinting-resistance/ Stylometric fingerprinting resistance.
It contains everything in that entry and more.
## Introduction
Following a recent landmark SCOTUS ruling, many have been trying to publish resources to help people find reproductive healthcare. They often wish to publish anonymously, to avoid being harassed or doxxed by overzealous religious fundamentalists. Some people asked me for help.
Theres no shortage of guides on how to stay anonymous online. I recommend using the Tor Browser in a disposable Whonix VM. The Whonix Wiki has a good guide to anonymous publishing:
=> https://www.whonix.org/wiki/Surfing_Posting_Blogging Surfing Posting Blogging
Few guides cover stylometric fingerprinting.
Stylometric fingerprinting analyzes unique writing style (i.e., it uses stylometry) to identify the author of a work. Its one of the most common techniques for de-anonymization, used by adversaries ranging from trolls to law enforcement.
For a TL;DR, skip to the “Conclusion” section.
## How stylometric fingerprinting works
To paint with a broad brush, we can divide stylometric fingerprinting into machine- and human-driven techniques.
* Machine-driven techniques: These techniques involve analysis of reading level metrics, unusual words, machine-identifiable grammatical and spelling errors, and statistical analysis of writing style. A great amount of recent research studies statistical analysis of writing style; its a rapidly-evolving field.
* Human-driven techniques: There are some areas in which manual analysis still beats computers. Someone you know may recognize your writing style.
## Defense against machine-driven techniques
Common advice is to use offline machine translation to translate works to and from another language. Two options that come to mind:
=> https://www.argosopentech.com/ Argos Translate
=> https://marian-nmt.github.io/ Marian
### Obfuscation and imitation
This paper shows that machine translation alone isnt nearly as strong a method as manual approaches:
=> https://doi.org/10.1145/2382448.2382450 Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity
Manual approaches include obfuscation (hiding your writing style) or imitation (mimicking another author). These approaches have excellent success rates, even among amateur writers. The aforementioned Whonix wiki page lists common stylometric fingerprinting vectors for manual approaches to address.
Limiting unusual vocabulary and sentence structure make for a good start. Using a comprehensive and highly-opinionated style-guide should also help. The Economist has a good one that was specifically written to make all authors sound the same:
=> https://cdn.static-economist.com/sites/default/files/pdfs/style_guide_12.pdf The Economist Style Guide, 12th edition (application/pdf)
Nearly every respected newspaper has a style guide. Tech companies typically publish or use a style guide for technical writing. Im a fan of Red Hats style guide:
=> https://stylepedia.net/style/5.1/ Red Hat Technical Writing Style Guide, Edition 5.1.
Good style guides are long, and learning one is tough. Read a large amount of conforming content to help yourself internalize a guides rules.
### Reading levels
The reading difficulty of a piece is a coarse but extremely simple measure. If the anonymous author of a piece writes at a tenth-grade level, their other writings are likely near a tenth-grade level too.
Check your contents reading level using common readability metrics. The Flesch-Kincaid Grade Level is one of the most popular. If you write at a given grade level, publish anonymously at a lower grade level. I recommend using a lower grade level rather than a higher one for two reasons:
1. Artificially inflating the reading level generally produces cringeworthy writing.
2. Writing at a lower level carries less risk of introducing uncommon fingerprintable words.
### Other tips
For any inexperienced writers: opinionated offline grammar checkers may supplement a manual approach by normalizing any distinguishing “errors” in your language. Two of my favorites:
=> https://github.com/languagetool-org/languagetool LanguageTool
=> https://github.com/redpen-cc/redpen RedPen
That being said, nothing beats a human editor. Find someone you trust to review your work.
Here are some resources for further reading:
=> https://www.orphanalytics.com/en/method "How we authenticate a document" by OrphAnalytics SA.
=> https://serhack.me/articles/unveiling-anonymous-author-stylometry-techniques/ "Unveiling the Anonymous Author: Stylometry Techniques" by SerHack.
## Defense against human-driven techniques
Human-driven techniques were behind the most high-profile successful cases of stylometric fingerprinting by law enforcement. When the Unabomber published his manifesto, his brother recognized some signature phrases in his manifesto and went to the FBI. One of the largest Australian CSA cases involved recognizing someones distinctive greeting of “Hiya”:
=> https://www.nowtolove.com.au/news/real-life/the-online-child-sex-abuse-epidemic-9139 Australian Womens' Weekly: The online CSA epidemic
While the research I cited seems to indicate that machine translation is a poor way to thwart machine-driven techniques, Im not convinced that its useless. Machine translation may be ineffective at thwarting machine-driven stylometric fingerprinting, but it should be able to highlight or remove uniqueness a human could notice. Lets look at some examples.
### Example: Unabomber manifesto
Heres a sentence from the Unabomber manifesto:
> But it is obvious that modern leftist philosophers are not simply cool-headed logicians systematically analyzing the foundations of knowledge.
David Kaczynski noticed the phrase “cool-headed logicians” as a potential signature identifying his brother. After further analysis, the author was clear.
> After I read the first few pages, my jaw literally dropped. One particular phrase disturbed me. It said modern philosophers were not “cool-headed logicians.” Ted had once said I was not a “cool-headed logician,” and I had never heard anyone else use that phrase…We had Lindas childhood friend Susan Swanson, a private investigator, take two of Teds letters to a linguistics expert, who analyzed them and found there was an 80 percent chance the same person had written the letters and the manifesto.
=> https://people.com/archive/blood-bond-vol-50-no-4/ David Kaczynski, Blood Bond
Lets run it through Argos Open Translates English-Esperanto filter, and then translate it back to English:
> But it is evident that modern left-wing philosophers are not simply dehydrated logicians systematically analyzing the foundations of knowledge.
“Dehydrated logicians” obviously isnt the right term to use here. The fact that the translation algorithm mistook “cool-headed” for “dehydrated” indicates that “cool-headed” was an unusual choice of words to use in this context, and therefore a fingerprinting vector. A more common phrase like “detached” would be safer.
(I didn't pick Esperanto for any particular reason)
### Extreme example: Mario
Lets look at another phrase:
> Whoohoo! Its a-me, Mario!
“Mario” is a common name, but I have a feeling you know that this “Mario” is the Nintendo character. Running this line through an English-Esperanto and Esperanto-English filter gives us:
> Who! Its me, Mary!
Our findings:
* “Whoohoo” is uncommon enough to be unrecognized by the translation algorithm. Replace or remove it.
* Saying “a-me” is less common than saying “me”.
* “Mario” is less common than “Mary” in an English-speaking context.
Additionally, inflection (e.g., using an exclamation point) can make your writing voice recognizable. Use it with care.
Lets re-phrase Marios line to be less identifiable:
> Hey, its me. Mario.
Thats a little better.
## Example: my fingerprint
I have my share of fingerprintable stylistic and formatting quirks. A non-exhaustive list:
* Qualifying lists with words like “non-exhaustive” or “incomplete”.
* Putting punctuation outside quotations or hyperlinks whenever acceptable.
* Using many colons, semicolons and conjunctive adverbs; for example, this sentence uses the latter two.
* Obsessively distinguishing between concepts and implementations of those concepts, resulting in a statistically high use of the word “implementation”.
* Introducing some articles with HTML description lists before elaborating on each entry in that list.
* Using soft-hyphens so text can wrap to narrow widths.
## Anonymous writing is a skill
Developing a high-quality writing style is hard. Developing and switching between multiple different styles is even harder. It takes time to master multiple style guides, cultivate a different "voice" free of your usual idiosyncrasies, and keep your different voices from influencing one another. "Getting into character" won't happen overnight if you don't have experience.
Translation tools are training-wheels to help you learn to identify idiosyncrasies. As you get better, you might be able to filter out problematic language without their help.
Don't feel pressured to start soon! If you want to publish under a truly anonymous pseudonym, you should first hone your craft. Alternatively, find a good editor (I am not the best person for this job, sorry).
## Conclusion
If you wish to write anonymously, I recommend the following:
1. Analyze your non-anonymous writing, looking for patterns. Write these patterns down. When you review your anonymous publications, refer to your list of identifiable patterns and remove them. Consider asking someone else for help, since your own bias might cause you to overlook some patterns.
2. Translate your work to and from another language with an offline machine translator. Dont actually publish the translators output; instead, read the output to identify potentially-fingerprintable language you missed in the previous step.
3. Correct any grammatical or spelling errors; mistakes you make might be unique.
4. Conform to an opinionated style-guide that you dont normally use in your other writing.
5. If, after doing the above, your anonymous work still has a reading level too similar to your non-anonymous writing: reduce the reading level by choosing simpler words.
Stay safe, everyone.