Pages Navigation Menu
cup of tea

How to use robots.txt file to prevent wordpress duplicate content issue

How to use robots.txt file to prevent wordpress duplicate content issue

WordPress generates archive, tag, comment, and category pages that raise duplicate content issues. We can tell search engines not to index our category archive pages, our tag archive pages, and year and month archive pages with a robots.txt file.

In this article you will learn to make one ultimate robots.txt file for wordpress to ignore our duplicated pages.We will also instruct search engines not to index our wp-admin area and not to index non-essential folders on our server.We can also ask bad bots not to index any pages on our site though they tend to usually do as they wish.

WordPress robots.txt file can be found at yoursite.com/robots.txt.But we have to customize this wordpress robots.txt for better SEO .

You can create a robots.txt file in any text editor and place the file in the root directory/folder of your website where you have installed wordpress and the search engines will find it automatically.If you find it difficult you can use a wordpress plugin kb robots to replace existing robots.txt file or to make your new one .The following robots.txt is quite simple but can accomplish much in a few lines:

User-agent: *

Disallow: /cgi-bin

Disallow: /wp-admin

Disallow: /wp-includes

Disallow: /wp-content/plugins

Disallow: /wp-content/cache

Disallow: /wp-content/themes

Disallow: /trackback

Disallow: /comments

Disallow: /category/*/*

Disallow: /tag/

Disallow: */trackback

Disallow: */comments

Line one “User-agent: * means that that this robots.txt file is to apply to any and all spiders and bots. The next twelve lines all begin with Disallow. The Disallow directive simply means “don’t index this location”. The first Disallow directive tells spiders not to index our /cgi-bin folder or its contents. The next five Disallow directives tell spiders to stay out of our WordPress admin area. The last six Disallow directives cure the duplicate content generated through trackbacks and comments and category pages.

We can also disable indexing of  archive pages by adding a few more lines one for each year of archives.

Disallow: /2006/

Disallow: /2007/

Disallow: /2008/

Disallow: /2009/

Disallow: /2010/

Disallow: /2011/

We can also direct e-mail harvesting programs, link exchanges schemes, worthless search engines, and other undesirable website visitors not to index our site

User-agent: SiteSnagger

Disallow: /

User-agent: WebStripper

Disallow: /

The lines instruct the named bots not to index any pages on your site. You can create new entries if you know the name of the user agent that you wish to disallow. SiteSnagger and WebStripper are both services that crawl and copy the entire website so that their users can view them offline. These bots are very unpopular with webmasters because they crawl thoroughly, aggressively, and without pausing, increasing the burden on web servers and diminishing performance for legitimate users.

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>