You know how important it is to score high. But your site isn't reaching
the first three pages, and you don't understand why. It could be that
you're confusing the web crawlers that are trying to index it. How can
you find out? Keep reading.
You have a masterful website, with lots of relevant content, but it
isn’t coming up high in the search engine results pages. You know
that if your site isn’t on those early pages, searchers probably
won’t find you. You can’t understand why you’re apparently
invisible to Google and the other major search engines. Your rivals
hold higher spots and their sites aren’t nearly as nice as yours.
Search engines aren’t people. In order to handle the tens of billions
of web pages that comprise the World Wide Web, search engine companies
have almost completely automated their processes. A software program
isn’t going to look at your site with the same “eyes”
as a human being. This doesn’t mean that you can’t have
a website that is a joy to behold for your visitors. But it does mean
that you need to be aware of the ways in which search engines “see”
your site differently, and plan around them.
Despite the complexity of the web, and dealing with all that data at
speed, search engines actually perform a short list of operations in
order to return relevant results to their users. Each of these four
operations can go awry in certain ways. It isn’t so much that
the search engine itself has gone awry; it may have simply encountered
something that it was not programmed to deal with. Or the way it was
programmed to deal with whatever it encountered led to less than desirable
results.
Understanding how search engines operate will help you understand what
can go wrong. All search engines perform the following four tasks:
• Web crawling. Search engines send out automated
programs, sometimes called “bots” or “spiders,”
which use the web’s hyperlink structure to “crawl”
its pages. According to some of our best estimates, search engine spiders
have crawled maybe half of the pages that exist on the Internet.
• Document indexing. After spiders crawl a page,
its content needs to be put into a format that makes it easy to retrieve
when a user queries the search engine. Thus, pages are stored in a giant,
tightly managed database that makes up the search engine’s index.
These indexes contain billions of documents, which are delivered to
users in mere fractions of a second.
• Query processing. When a user queries a search
engine, which happens hundreds of millions of times each day, the engine
examines its index to find documents that match. Queries that look superficially
the same can yield very different results. For example, searching for
the phrase “field and stream magazine,” without quotes around
it, yields more than four million results in Google. Do the same search
with the quote marks, and Google returns only 19,600 results. This is
just one of many modifiers a searcher can use to give the database a
better idea of what should count as a relevant result.
• Ranking results. Google isn’t going to
show you all 19,600 results on the same page – and even if it
did, it needs some way to decide which ones should show up first. Thus,
the search engine runs an algorithm on the results to calculate which
ones are most relevant to the query. These are shown first, with all
the others in descending order of relevance.
Now that you have some idea of the processes involved, it’s time
to take a closer look at each one. This should help you understand how
things go right, and how and why these tasks can go “wrong.”
This article will focus on web crawling, while a later article will
cover the remaining processes.