You heard it a couple of years ago: The World Wide Web is far deeper and more
vast than previously imagined. That is terrific news. But where are all those
fascinating, hidden pockets of information that are supposedly out there,
and why can't we find them on Yahoo?
"No one can argue with the fact the Web is huge and it continues to
grow at an astronomical rate," Giga Information
Group analyst Laura Ramos told NewsFactor.
"Really, what people are dealing with is how to harvest the valuable information out
there, because there is a lot of good stuff and there is a lot of junk," Ramos said.
Billions More Documents
Experts estimate that the "surface Web" contains 1 billion to 2 billion
documents, while the "deep Web" could contain as many as 550 billion. Put another
way, the surface Web contains about 19 terabytes of information, while the deep Web
contains about 7,500 terabytes.
A terabyte is a measure of data storage. One terabyte is the equivalent of about 1,600
CDs or 1,000 gigabytes.
There are more than 200,000 deep Web sites, more than half of which are located in
topic-specific databases. About 95 percent of information on the deep Web is
available to the public and is not subject to subscription fees.
Why Is It Hidden?
Many surfers cannot find these sites, however, because each page generally is not
linked to many other pages.
Full-text search engines get their listings in one of two ways: Site developers can
submit addresses to a search engine, asking to be indexed; or a search engine
can use "spiders," which depend on links from existing sites to discover new ones.
While there is a huge amount of information on the deep Web, much of it is valuable
primarily to researchers, scholars and the merely curious, so it may have few, if any,
links. Without such links, search engines can find such sites only by chance.
Also, more and more information is being stored by governments, universities and
corporations in monster databases. These databases cannot be accessed by
conventional search engines, which identify "static" pages rather than the "dynamic"
pages used by large databases. Information in such databases can be accessed only
by a direct query.
Theoretically, search engines create and maintain their own databases in an effort to
index the entire Web. But even the biggest and best search engines can index only
between one-third and one-half of all publicly available documents. (continued...)
|