Our Network: Proxy Directory NEW! Top Sites Learn SEO eBusiness solutions Proxy Browsing NEW! Anonymous Surfing Shrink your link! NEW!
 Submit new Articles
Meta Robots Tag and robots.txt PDF Print E-mail
User Rating: / 0
PoorBest 
Written by Seo Master   
Saturday, 01 September 2007

Image Robots

There are two ways you can restrict a spider from crawling all or part of your site. First is by placing the META Robots tag within the “head” section of your HTML file (making it effective only for the pages where the tag is inserted). The second is to write a special instruction file called "robots.txt" and put it in the root directory of your site.

Robots are useful in terms of SEO since it is understood that a search engine spider has a certain limit of pages within your domain to index. Whatever this limit might be, you don't want to waste your search engine reserve by allowing it to index the files which are not optimized or not meant to be seen by the search engines.

The need was also felt for the robots.txt file to stop robots from deluging servers with rapid-fire requests or re-indexing the same files repeatedly. If you have duplicate content on your site for any reason, the same can be controlled from getting indexed. This will help you avoid any duplicate content penalties.

Also, webmasters might want to exclude contents of private or secret folders from indexing.

The META Robots tag

The Robots META tag is a tag within the HTML code of a site that instructs search engine robots what pages of a site they should index and what pages they should avoid. Use robots to specify any pages you want kept out of the search engine indices (e.g., order forms and guest books).

In the HTML code of a Web page, a sample Robots META Tag looks like this:

<meta name="robots" content="index, follow" />

"index" means that search engine is allowed to index this page and "follow" means it is allowed to follow the links and discover the new pages this one links to.

You can instruct a search engine not to index a page by changing the content of the tag to “noindex, follow” or “noindex, nofollow” if you don't want it to follow links on the page either.

The Robots META tag must be placed in the "head" section of your HTML code. Some search engines do not support this tag and require that only the Robots Exclusion Protocol is used (which is supported by every search engine).

Googlebot and MSNBot tags

As you remember, the Google's and MSN's spiders are called GoogleBot and MSNBot respectively. When reading your html pages, these “bots” will look for special META tags called META GoogleBot and META MSNBot. These Meta tags are meant to provide Webmasters who do not have access to the root domain directory (for placing a "robots.txt" file, discussed later) with a way to close parts of their sites from crawling by these two robots.

The syntax is as follows:

<meta name="googlebot" content="noindex" />

(you may use either "noindex", or "nofollow", or "noarchive", or "nosnippet", or any combination of these values separated by commas for the "content" attribute. For instance, "nosnippet" will tell Google not to display snippets of your page in its SERP and not to archive a copy of the document).

The same syntax can be used on your page for MSNBot:

<meta name="msnbot" content="noindex, nofollow">

Please keep in mind that GoogleBot will only recognize the four commands mentioned above, but MSNBot only two of them (noindex, nofollow). Commands like "index" or "follow" will be ignored.

Robots Exclusion Protocol (robots.txt File)

The Robots Exclusion Protocol, commonly referred to as the Robots.txt file, is another method to allow Web site administrators to instruct visiting robots which parts of their site should not be visited and indexed.

When a search robot visits a web site, it firsts checks for the existence of the file called "robots.txt" in the root directory of the site (www.yoursite.com/robots.txt). If this document is detected, the spider will follow the instructions found within.

Robots.txt file contains information in the following format:

User-agent: *

Disallow: /

The file always contains two fields, the first being the robot it addresses, the second being the directory (or directories) disallowed for browsing.

The string with the "Disallow" instruction specifies URLs which the specified robots have no access to.

Here "*" means all robots and "/" means all URLs. When specifying the URLs, you write everything that follows your root (home) URL, including the slash. Thus, using only a slash means your home directory itself. So this is read as "No access for any search engine to any URL".

In the following example, nothing is restricted from Googlebot so it may browse any files and directories:

# Guarantees access for Googlebot (characters after # and up to newline

# are considered comments).

User-agent: Googlebot

Disallow:

If you ever need to instruct multiple spiders about multiple directories, you may pass several commands:

User-agent: Googlebot

Disallow:

User-agent: *

Disallow: /cgi-bin/

This will disallow all spiders to scan your "cgi-bin" directory (where most webmasters keep the server-side scripts) however the GoogleBot will have access to it.

Comments on using Robots Exclusion Protocol

  1. The "robots.txt" file must always be named in lowercase, even if your site is hosted on a case-insensitive platform like Windows (e.g. "Robots.txt" or "robots.Txt" is incorrect).
  2. Wildcards are not supported in both the fields. “*” can only be used in the User-agent field command syntax to denote "all". Googlebot is the only robot that now supports some wildcard file extensions, giving you the possibility to exclude certain file types from indexing. More information can be found on http://www.google.com/webmasters/
  3. Website functionality is not effected if your robots.txt is absent or empty. though it does open access for all robots to crawl all areas and pages of your site. However, with some servers and some crawlers an absent robots.txt file can generate a 404 error and redirect the robot to your default 404 error page. The robot considers it to be your "robots.txt" file and it's behavior will become unpredictable. We recommend you always have a "robots.txt" file.
  4. Only one robots.txt file can be maintained per domain and it must be placed in the root directory of your site, i.e. in the same directory where you keep your home page.
  5. Website owners who do not have administrative rights or write access to the root domain URL will propably not be able to use a robots.txt file. In such situations, you may attempt to use the META Robots Tag (see the related comments above in this lesson).
  6. Separate lines are required for specifying access to different user agents and the Disallow field of the robots.txt file should not carry more than one command in a line, though there is no limit to the number of lines. Both the User-agent and Disallow fields can be repeated with different commands any number of times. Blank lines will also not work within a single record set of both the commands.
  7. Use lower-case for all robots.txt file content (except where you need to provide a directory or file name in the Upper case on case-sensitive platform, e.g. Unix).

More rules and guidelines on using Robots can be found at http://www.robotstxt.org/wc/norobots.html

Comments (1)Add Comment
JCT informàtica
written by JCT informàtica, April 14, 2008
Venda i reparació d'ordinadors i instal·lacions de GUIFI,xarxes i manteniment informàtic per empreses i particulars. Tel: 93.859.62.25

Write comment
quote
bold
italicize
underline
strike
url
image
quote
quote
Smiley
Smiley
Smiley
Smiley
Smiley
Smiley
Smiley
Smiley
Smiley
Smiley
Smiley
Smiley

security code
Write the displayed characters


busy
 

Ads

Polls

Where did your hear about WebUniver?
 

Who's Online

Image

Learn to Fail to be Successful in Making Money Online

28.08.2007 | Business

Being successful online is part of all bloggers dream and to be successful in making money online, you need to fail to gain experiences and learn from your mistakes so that it will not repeat its history. Many people do not understand the word "Failure" and once they fail…

Auditing and Improving Your Site

22.10.2007 | Marketing

You may ask yourself why we include material on site maintenance in a course that deals mainly with search engine optimization, promotion, and marketing – isn’t this the job of the webmaster or site administrator?Remember the Integrated Approach considers site quality maintenance a secondary yet obligatory addition to your promotion…

Image

What Are The Pros And Cons Of Using Flash Sites

22.08.2007 | Design & Development

Flash - based sites have been a craze since the past few years, and as Macromedia compiles more and and great features pursuit Flash, we can only predict practiced will exemplify more and more flash sites around the Internet. However, Flash based sites have been disputed to be bloated…