News & Information for Technology Purchasers
NewsFactor Network Sites:   NewsFactor.com Security CRM Business Sci-Tech Newsletters XML/RSS Feed  
   
Home Enterprise I.T. Hardware Software Communications More Topics...
World Wide Web
Average Rating:
Rate this article:  
Google Makes PDF Files Searchable Google Makes PDF Files Searchable
By Mark Long
October 31, 2008 12:34PM

    Bookmark and Share
Google is using optical character recognition software to allow users to search Web-based PDF files. Google says the OCR technology will not only be used for indexing and searching PDFs, but may also be used for Google Book Search. Google's use of OCR for PDFs underscores the need for search engines to be able to scan multimedia content.
 



Google has rarely included scanned documents in its search results because it had no way to determine the nature of the content, but that's about to change. The search engine giant says it will use optical character recognition (OCR) software to make it possible for Web surfers to search any Web-hosted document stored in the PDF file format developed by Adobe Systems.

Google is using the technology to convert scanned documents into equivalent text files that can be searched, indexed and returned as responses to Google search queries, noted Evin Levey, a Google product manager.

"This is a small but important step forward in our mission of making all the world's information accessible and useful," Levey said.

A Boon for Books

The company's brute-force application of OCR technology to the Web is also expected to aid Google Book Search -- the ambitious and controversial book-scanning project that the search engine giant first unveiled at the 2004 Frankfurt Book Fair. Ever since, Google has been scanning the book collections at the world's major libraries at a rate of 3,000 book titles per day.

Though the project initially raised copyright concerns, Google has just concluded an agreement with the Authors Guild and the Association of American Publishers under which Google will be able to expand online access to millions of in-copyright books and other written materials in the United States. The agreement resolves lawsuits that had challenged Google's plan to digitize, search and show snippets of in-copyright books and to share digital copies with libraries without the explicit permission of the copyright owner.

Google's Chief Legal Officer David Drummond says the agreement is truly groundbreaking because it will give readers online access to millions of in-copyright books for the very first time.

"Second, it will create a new market for authors and publishers to sell their works," Drummond explained. "And third, it will further the efforts of our library partners to preserve and maintain their collections while making books more accessible to students, readers and academic researchers."

Pursuing the Holy Grail

Given the continuing exponential growth of multimedia on the Web, however, the text-based nature of today's search-engine technology is clearly inadequate. That's because current-generation search engines can only locate multimedia material that has been tagged in text -- a cumbersome, time-consuming process that content producers often overlook.

This explains why a number of researchers are hot in pursuit of the Holy Grail of search -- the means whereby search engine providers can directly scan Relevant Products/Services multimedia content and match results to search queries and the ad placement requests of their customers. Adobe Systems has already taken a step along the road to producing the next generation of search technology.

In July, the company revealed that it had optimized its Adobe Flash Player technology to enable search engines to index multimedia content produced in the Flash file format -- content that previously had been undiscoverable.

"We are initially working with Google and Yahoo to significantly improve search of this rich content on the Web," explained David Wadhwani, Adobe's vice president. "And we intend to broaden the availability of this capability to benefit all content publishers, developers Relevant Products/Services and end users."
 

Tell Us What You Think
Your Comment:



Advertisement


 World Wide Web
1.   Germany Takes Stance on Street View
2.   Macmillan Books Return To Amazon
3.   New Zealand Virgin Auctions Herself
4.   China Busted Hacker-Training Site
5.   FBI Tackles Haiti-Relief Scams


advertisement
Books on Social-Media MarketingBooks on Social-Media Marketing
Cost-effective ways to engage clients.
Average Rating:
Google Considers Pulling Out of ChinaGoogle Considers Pulling Out of China
Attacks prompt an end to censorship.
Average Rating:
New Zealand Virgin Auctions HerselfNew Zealand Virgin Auctions Herself
'Unigirl' was desperate for tuition.
Average Rating:
Product Information and Resources for Technology You Can Use To Boost Your Business

Enterprise Hardware Spotlight
Nvidia Auto-Switches Notebook GPU To Save Battery Life
Nvidia has taken the wraps off a notebook technology that chooses the best graphics processor for any given application and automatically routes the workload to Nvidia or Intel processors.
 
Microsoft Says Battery Woes Not Caused By Windows 7
Battery problems on Windows 7 machines are not caused by the operating system. That's the position of Stephen Sinofsky, head of the Windows division, in a long posting on the Windows engineering blog.
 
IBM's New POWER7 Servers Save Energy with Big Loads
IBM has unveiled high-capacity servers that are the first to be based on its new, multi-core POWER7 chip. It said the new line is designed "to manage the most demanding emerging applications."
 

Enterprise Technology Spotlight
Intel Launches Quad-Core Itanium 9300 Series Processor
After two unexpected delays, Intel has launched the Itanium 9300 series, a 64-bit, quad-core processor code-named Tukwila that is expected to double the performance of its predecessor.
 
Google May Add Facebook, Twitter Links to Gmail
Google will reportedly roll more social-networking features into Gmail, the fastest-growing e-mail service. The new features could save users the trouble of switching to Facebook or Twitter.
 
IBM's New POWER7 Servers Save Energy with Big Loads
IBM has unveiled high-capacity servers that are the first to be based on its new, multi-core POWER7 chip. It said the new line is designed "to manage the most demanding emerging applications."
 

Navigation
NewsFactor Network
Home/Top News | Enterprise I.T. | Hardware | Software | Communications | Network Security | Wireless Tech | Linux/Open Source
Apple/Macintosh | Microsoft/Windows | World Wide Web | Data Storage | E-Commerce | Personal Tech | Tech Trends | Press Releases
NewsFactor Network Enterprise I.T. Sites
NewsFactor Technology News | Enterprise Security Today | CRM Daily

NewsFactor Business and Innovation Sites
Sci-Tech Today | NewsFactor Business Report

NewsFactor Services
FreeNewsFeed | Free Newsletters | Free Whitepapers | XML/RSS Feed

About NewsFactor Network | How To Contact Us | Article Reprints | Careers @ NewsFactor | Services for PR Pros | Top Tech Wire | How To Advertise

Privacy Policy | Terms of Service
© Copyright 2000-2010 NewsFactor Network. All rights reserved. Article rating technology by Blogowogo.