Comments on: Webmasterworld disallows all search engine bots in robots.txt. WTF! http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/ It's Just Links Wed, 14 Sep 2011 13:47:04 +0000 http://wordpress.org/?v=2.9.1 hourly 1 By: Tony Spencer http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/comment-page-1/#comment-441 Tony Spencer Wed, 23 Nov 2005 02:15:25 +0000 http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/#comment-441 If you have a site that you want me to stop scraping, placing some css hidden text will not stop me. I simply alter my regular expressions to remove that injected text. (You have to wrap the text with some tag such as a div in order to hide it from the end user) Not to mention, Tabke is anti anything that adds bulk to the pages on Webmasterworld. The only way to truly stop rogue bots is to track hits from IP's and block to many accesses within an interval (<a href="http://www.webmasterworld.com/forum88/7288.htm" rel="nofollow">example script</a>). I rewrote this for my own stuff to implement a captcha for IP's that trip the filter. To go a step further you should regularly pull down lists of known proxies and completely ban them via .htaccess in Apache. If you have a site that you want me to stop scraping, placing some css hidden text will not stop me. I simply alter my regular expressions to remove that injected text. (You have to wrap the text with some tag such as a div in order to hide it from the end user)

Not to mention, Tabke is anti anything that adds bulk to the pages on Webmasterworld. The only way to truly stop rogue bots is to track hits from IP’s and block to many accesses within an interval (example script). I rewrote this for my own stuff to implement a captcha for IP’s that trip the filter. To go a step further you should regularly pull down lists of known proxies and completely ban them via .htaccess in Apache.

]]>
By: Russ Jones http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/comment-page-1/#comment-440 Russ Jones Wed, 23 Nov 2005 01:01:53 +0000 http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/#comment-440 While I think his robots.txt reasoning is foolish, using Google's cached results seems to be a much more effective way of scraping a site on a keyword-by-keyword basis: especially for article syndication sites. Preventing indexing may be valuable in that regard. To truly prevent scraping, though, if search rankings are not essential, could be accomplished by inserting between every other word (or so) some css hidden text, thus making the scraped content wholly unintelligible, moving the scraper onto another less-well-protected site. While I think his robots.txt reasoning is foolish, using Google’s cached results seems to be a much more effective way of scraping a site on a keyword-by-keyword basis: especially for article syndication sites. Preventing indexing may be valuable in that regard.

To truly prevent scraping, though, if search rankings are not essential, could be accomplished by inserting between every other word (or so) some css hidden text, thus making the scraped content wholly unintelligible, moving the scraper onto another less-well-protected site.

]]>