Home > Search Engine Optimization > Webmasterworld disallows all search engine bots in robots.txt. WTF!

Webmasterworld disallows all search engine bots in robots.txt. WTF!

November 22nd, 2005 Tony Leave a comment Go to comments

When Brett Tabke decided to disallow all search engine bots via robots.txt on Webmasterworld, I was amazed to hear that his reasoning was that rogue bots check the robots.txt and thus by disallowing them they would go away. That hasn’t been my experience on any of my sites. In fact when I write code to scrape the hell out of a site i never include this line:

$handle = fopen(“http://www.somesiteabouttobebangedon.com/robots.txt”, “r”);

I just let loose a slew of asynchronous threads and start chewing.

Good luck to all those poor souls that were using Google for site search on Webmasterworld. Brett claims that there will be a functioning solution in 60 days. However, I hounded him about it over a year ago at Pubcon Las Vegas and he also said it would be ready in 60 days. Again I hounded him about in in the supporters forum and was promised something soon. While I was in Vegas last week for Pubcon November 2005 I also spoke to the programmer hired to take on the task of building a Webmasterworld site search. He said he was planning on a custom flat file approach. Yikes. Reinventing the wheel might take more than 60 days even for Larry Ellison. :)

On the Other Hand

Perhaps Brett’s real motivation for taking this seemingly ridulous step is to improve the quality of the discussion. The massive flood of newbies to the board has created a level of noise that makes it damn near impossible to have useful conversations like the ones that went on a few years ago. Theres no doubt that the success of NickW’s Threadwatch is due in large part to an exodus of quality posters looking for a quieter place to talk SEO. (Although many factors make TW a great read such as attitude and style of moderation). I started a private, invite only SEO forum a year ago for the same reasons and have found myself visiting WMW very rarely en lieu of Threadwatch and private forums. So maybe banning all bots is Tabke’s last ditch effort to save the original SEO forum from implosion.

Categories: Search Engine Optimization Tags:
  1. Russ Jones
    November 22nd, 2005 at 20:01 | #1

    While I think his robots.txt reasoning is foolish, using Google’s cached results seems to be a much more effective way of scraping a site on a keyword-by-keyword basis: especially for article syndication sites. Preventing indexing may be valuable in that regard.

    To truly prevent scraping, though, if search rankings are not essential, could be accomplished by inserting between every other word (or so) some css hidden text, thus making the scraped content wholly unintelligible, moving the scraper onto another less-well-protected site.

  2. Tony Spencer
    November 22nd, 2005 at 21:15 | #2

    If you have a site that you want me to stop scraping, placing some css hidden text will not stop me. I simply alter my regular expressions to remove that injected text. (You have to wrap the text with some tag such as a div in order to hide it from the end user)

    Not to mention, Tabke is anti anything that adds bulk to the pages on Webmasterworld. The only way to truly stop rogue bots is to track hits from IP’s and block to many accesses within an interval (example script). I rewrote this for my own stuff to implement a captcha for IP’s that trip the filter. To go a step further you should regularly pull down lists of known proxies and completely ban them via .htaccess in Apache.