![]() |
|
06.23.09 How To Effectively Use Robot.txt Files With Your Site By
Patrick Hare
On many occasions customers come to us with the complaint that they can't be found. They either had rankings on all search engines and suddenly disappeared, or never were seen in the first place. Believing that they are the victims of a ban in the search engines, they come to us for search engine optimization advice. In many cases, the culprit is found in the robots.txt file, in the form of the classic: User-agent: * Disallow: / (Special Note: Using this command will make your site disappear in the search engines!) The forward slash after the disallow tells the engines to ignore all files. The soluton to this problem is to delete the forward slash, which tells search engines that everything is fair game. If you use Google Webmaster Tools, you will be told that the robots file prevents the indexing of your site. Many times a webmaster will upload this accidentally, or forget to take it down when a dev site goes live. The command effectively tells every honest search engine spider to stop reading your site and go away. Note that unethical spiders that scrape for phone numbers, email addresses, and content will not even bother to look at your robots.txt file, unless they are programmed to look for the files you don't want found. If you are looking to block search spiders from dishonest people on the internet, the robots.txt file is probably not going to help you, so you should look to server level exclusions. Depending on the complexity of your site, the robots.txt file can be modified to support your SEO initiatives. If you have a series of pages in a shopping cart, forum, or section that you want to exclude, you can disallow a specific directory: Disallow: /Example If you have multiple directories, you would just add them to the list: User-agent: Disallow: /Example Disallow: /secret_plans Disallow: /things_we_do_not_want_the_world_to_know or you can use a newer wildcard format that disallows pages with certain phrases of string segments in them. If you wanted to disallow all the pages with a session ID in them, you could use a command that says:
Disallow: /*sessionid Keep in mind that this will effectively shut out search engines for these pages, so you should ensure that your string is long enough that it does not accidentally blind the engines to pages that you want to get found. The wildcard robots disallow is ideal for people who may have bought sites and then found out that the site was a parked domain with thousands of "junk" pages installed by a previous owner. Even if you don't have any of those pages on your site, it can take months for Google to notice that they no longer exist. By excluding them in your robots file, the removal of those cached pages can take less time. In the past, people have disallowed the /images directory but normally we don't recommend this. Image and universal search features on search engines allow for your images to get indexed, and this leads to traffic. One of our clients made a substantial number of sales based on image search, so excluding this directory should be done with some thought. If you want to exclude certain search engines, or direct them away from certain directories, it is easy to set up separate exclusion protocols in the file. For instance, excluding Yahoo! (which uses the "Slurp" robot") from seeing a directory would be done this way: Continue reading this article.
|
|
|
|
-- DevWebProNL is an iEntry, Inc. publication -- iEntry, Inc. 2549 Richmond Rd. Lexington KY, 40509 © 2009 iEntry Inc. All Rights Reserved Privacy Policy Legal
|