Neil Turner's Blog

Blogging about technology and randomness since 2002

Stopping crawlers from indexing your comments

Richy has started an interesting discussion about marking out blocks of text so that robots do not index them. The idea is that we could put something like <!– robots:noindex –> tags around blocks of text, say comments sections on weblogs, and have Google, or any other crawler not index them. The idea is a good one, but it got me thinking and I came up with two ways of doing this now in MT.

1. Using the comments popup windows instead of Individual Entry Archive pages

Instead of having the comments and trackback sections on the individual archive pages, use the popup windows. This is because you will be able to stop robots from accessing that in one of a number of ways. The first is to use this code block:

User-Agent: *
Disallow: /*.cgi$

That code should tell all crawlers not to index files with the extension .cgi, which the popup windows use (they use a function of mt-comments.cgi). You can go further, however, and use this, perhaps in addition to the above:

User-Agent: *
Disallow: /cgi-bin/mt/

Here, I’ve told the robots to leave my entire MT directory structure alone. You’ll need to change /cgi-bin/mt/ to match the directory where your MT scripts are installed.

2. Telling crawlers not to follow links on individual archive pages

This is quite trivial. You need to add the following line of markup to the <head> section of your Individual Archive Pages template:

<meta name="robots" content="index, nofollow" />

This tells crawlers to index the page, but not to follow links. This means that if any spammed URLs are posted, the crawler should not follow those links and therefore not give them a pagerank boost (at least in theory). The disadvantage of this is that any links you have included in the main body of the post will not be followed either.

3. Putting all posted URLs in comments through a redirect script

Get yourself a redirect script, in whatever language you choose (PHP, Perl, Python…) and then force all URLs posted with comments to be passed through that. At the simplest level, you just need to change <$MTCommentAuthorLink$> to something like:

<a href="<$MTCommentURL$>"> <$MTCommentAuthor$></a>

Now, in your robots.txt file, add this:

User-Agent: *
Disallow: /pathtoscript/

This is what WebmasterWorld uses on its boards to discourage users from posting URLs simply to give them a pagerank boost (as opposed to taking part in the discussions).
The proposal that Richy puts forward, which is based on an open letter to Google by Colin Roald, is noteworthy because it’s simple enough to be included in MT’s default templates. Comment spam is starting to become such an issue that I think Six Apart may well be justified in including some robot blocking tags by default, so that more people will have them enabled. The author cites this entry on a somewhat abandoned blog as an example – it’s full of comment spam, although I’d like to think that Google’s algorithm would be intelligent enough to realise that this page has been ‘hijacked’ by spammers and therefore isn’t worth indexing. It already does that for some free link sites.
In any case, I hope this is food for thought. I still think that using a tool like MT-Blacklist combined with regular cleaning of spam is perhaps the best option over these suggestions, but you may get some mileage from them.

One Comment

  1. Stop spammers by keeping comments out of Google

    Neil posted a suggestion on how to ebb the tide of comment spam. He suggested that you stop search engines from indexing your blog comments. The theory is that (if the links posted in/by comments aren’t indexed in the search…