On Wednesday, Google announced plans to acquire a startup that helps Web sites combat spam and fraud. Google is investing an undisclosed amount to bring reCAPTCHA into its technology fold to address scanning challenges in the Google Books project.
reCAPTCHA is a free anti-bot service that helps digitize books. The company also provides CAPTCHAs to help protect more than 100,000 Web sites. A CAPTCHA is a program that can detect whether its user is a human or a computer.
CAPTCHAs appear as images with distorted text at the bottom of Web registration forms and are used by many Web sites to prevent abuse from automated programs written to generate spam. But Google sees it as a way to teach computers to read.
Teaching Computers To Read
Luis von Ahn, cofounder of reCAPTCHA, and Google product manager Will Cathcart explained the reCAPTCHA twist: The words in many of the CAPTCHAs provided by reCAPTCHA come from scanned archival newspapers and old books.
"Computers find it hard to recognize these words because the ink and paper have degraded over time, but by typing them in as a CAPTCHA, crowds teach computers to read the scanned text," von Ahn and Cathcart explained. "In this way, reCAPTCHA's unique technology improves the process that converts scanned images into plain text, known as optical character recognition (OCR)."
Now here's the Google-reCAPTCHA connection: OCR also powers large-scale text-scanning projects like Google Books and Google News Archive Search. As Google sees it, having the text version of documents is important because plain text can be searched, easily rendered on devices, and displayed to visually impaired users.
Google plans to apply the reCAPTCHA technology not only to increase fraud and spam for Google products but also to improve the books and newspaper scanning process. Google will also continue to allow Web-site owners to use reCAPTCHA free of charge to protect their digital assets.
Between the Lines
Google is embroiled in a legal controversy in its Google Books project. Last October, Google settled a class-action copyright suit filed by the Authors Guild and the Association of American Publishers. But Amazon and Microsoft, among others, are speaking out against the deal, which has not yet been settled in federal court.
"Having an archive of the world's knowledge is not something Google feels is outside the scope of its interests," said Brad Shimmin, an analyst at Current Analysis. "This acquisition is Google saying they are going to continue scanning books because they know they are within their rights to do so, and now they are going to do it better with this technology."
With reCAPTCHA, Shimmin sees opportunities for Google to stand out in the small crowd of players scanning out-of-copyright books. Although it may not seem like an earth-shattering acquisition, Shimmin said, it may help Google compete against Microsoft and Yahoo in the long term.
"With some of the moves Microsoft and Yahoo have been making lately, it's not a done deal that Google is going to be the leading search destination for the next 10 years," Shimmin said. "Google recognizes that danger and is constantly looking to not only broaden its portfolio but also deepen its capabilities in a way that differentiates the company from Microsoft and Yahoo. reCAPTCHA helps that cause."