Most of the stumbling blocks above are ones you may have accidentally
put in the way of spiders. This next set of stumbling blocks includes
some that website owners might use on purpose to block a search engine
spider. While I mentioned one of the most obvious reasons for blocking
a spider above (content that users must pay to see), there are certainly
others: the content itself might be free, but should not be easily available
to everyone, for example.
Pages that can be accessed only after filling out a form and hitting
“Submit” might as well be closed doors to spiders. Think
of them as not being able to push buttons or type. Likewise, pages that
require use of a drop down menu to access might not be spidered, and
the same holds true for documents that can only be accessed via a search
box.
Documents that are purposefully blocked will usually not be spidered.
This can be handled with a robots Meta tag or robots.txt file. You can
find other articles that discuss the robots.txt file on SEO Chat.
Pages that require a login block search engine spiders. Remember the
“spiders can’t type” observation above. Just how are
they going to log in to get to the page?
Finally, I’d like to make a special note of pages that redirect
before showing content. Not only will that not get your page indexed,
it could get your site banned. Search engines refer to this tactic as
“cloaking” or “bait-and-switch.” You can check
Google’s guidelines for webmasters (http://www.google.com/intl/en/webmasters/guidelines.html)
if you have any questions about what is considered legitimate and what
isn’t.
Now that you know what will make spiders choke, how do you encourage
them to go where you want them to? The key is to provide direct HTML
links to each page you want the spiders to visit. Also, give them a
shallow pool to play in. Spiders usually start on your home page; if
any part of your site cannot be accessed from there, chances are the
spider won’t see it. This is where use of a site map can be invaluable.