“Captcha” is the official term for those wavy strings of numbers and letters that you have to decipher before setting up an online email account or gaining access to other types of web sites. The acronym, coined by someone at Yahoo a few years back, stands for Completely Automated Public Turing Test to Tell Computers and Humans Apart. Captchas are intended to separate men from machines in order to prevent spammers and other nasty folks from using automated means to crack into sites.
Problem is, as the Washington Post reports today, the machines keep getting smarter. The spammers are thinking up ever more ingenious ways to break the captchas. It used to be assumed that spammers were somehow deploying people to crack the codes, paying tiny sums to third-world laborers to type in the characters. Google, which has recently suffered attacks on its captcha systems for Gmail and Blogger, still believes that people are doing the work. A Google spokesman tells the Post: “We still believe there is human involvement.”
Security experts, though, are increasingly convinced that the most sophisticated captcha attacks are actually being carried out wholly by machines:
The attack that most clearly signals that computers were solving a CAPTCHA came about a month ago, when Websense detected what appeared to be some malicious traffic from one of its “threat-seeker” honey pots. Once it attracted the malicious code, the decoy sought repeatedly to create Hotmail accounts. Over and over, when it was presented with the Hotmail CAPTCHA, it sent the letter puzzle to another computer. That computer would respond within about six seconds, a speed that leads computer analysts to think the CAPTCHA was being cracked by a computer, not a human.
No one seems to be quite sure, though, how exactly the computers are doing it. And the increasing sophistication of the automated attacks puts site owners in a quandary, as the Post reports: “Microsoft and other Web companies say they are interested in creating human verification tests that are harder for computers to crack. But there’s an inherent difficulty. Making the tests harder for the computer makes them harder for humans, too.” You may outsmart the people before you outsmart the machines.
Which raises a bigger question: What happens if the bad guys get the AI first?
I’m dubious that it’s all-machine. This is the standard superhero-comic problem of “Why do crooks who are super-geniuses or have ultra-high-technology waste their time robbing banks and brawling?”. Anyone who has developed image-reading technology beyond the state of the art is not going to be keeping it generally secret and only for spammer use. They could make far more money licensing it for OCR.
Don’t believe everything you read in the papers. I suspect some CAPTCHAs are known to be cracked with well-understood attacks, but the companies affected are not going to say that.
Also, six seconds is too much time to rule out human intervention. I can “solve” most CAPTCHAs in a few seconds. I’ve read people are farming these tasks out “mechanical turk” style to people who do them for hours on end.
There is also the solve-this-puzzle-to-access-more-pornography technique as explained by Luis von Ahn in the Google Talk video linked from his home page.
(He and his CMU colleagues were the ones to coin the phrase, the first system was used at Yahoo)
I’m just wondering: Has there been any actual documentation of these captcha-solving sweatshops or the systems they use?
I’ve always preferred the xkcd captcha myself.
KittenAuth
Pretty hard to do automatically. Doesn’t solve the free porn issue though….
Cory Doctorow had a similar idea in a short story he wrote called “I Rowboat” (take off on Asimov’s)
““Spam-filters, actually. Once they became self-modifying, spam-filters and spam-bots got into a war to see which could act more human, and since their failures invoked a human judgement about whether their material were convincingly human, it was like a trillion Turing-tests from which they could learn. From there came the first machine-intelligence algorithms, and then my kind.””
http://www.flurb.net/1/doctorow.htm
Nick, there a typo in the text: “Googe”.
They are probably training a neural network using data sets obtained by human users: graphic image vs. translated letter sequence. It’s the kind of stuff that graduate students have been doing for years. Unless they are varying the algorithm in subtle ways over a long period of time, the net given a large enough dataset would probably be able to learn the patterns and crack it pretty easily. Even with their limitations, nets can be pretty good at finding subtle patters that human programmer subconsciously coded in but didn’t realize were there.
I sincerly beleive that trivia questions could be useful ways to resolve that issue: KittenAuth gives a good idea of the many ways you can vary the test format to defeat most computers.
What a delightful quandary – it’s getting so the science fiction writers can’t keep up with reality!
Isaac Asimov and Phillip Dick must be chortling in their graves, while William Gibson seems to be just giving up on the future part of science fiction and writing novels set in the present.
Speaking of Captcha’s here are the top 10 worst ones if you want something to laugh about…
Top 10 Worst Captchas