<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Webmasterworld disallows all search engine bots in robots.txt.  WTF!</title>
	<atom:link href="http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/</link>
	<description>It&#039;s Just Links</description>
	<lastBuildDate>Wed, 14 Sep 2011 13:47:04 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Tony Spencer</title>
		<link>http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/comment-page-1/#comment-441</link>
		<dc:creator>Tony Spencer</dc:creator>
		<pubDate>Wed, 23 Nov 2005 02:15:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/#comment-441</guid>
		<description>If you have a site that you want me to stop scraping, placing some css hidden text will not stop me.  I simply alter my regular expressions to remove that injected text.  (You have to wrap the text with some tag such as a div in order to hide it from the end user)

Not to mention, Tabke is anti anything that adds bulk to the pages on Webmasterworld.  The only way to truly stop rogue bots is to track hits from IP&#039;s and block to many accesses within an interval (&lt;a href=&quot;http://www.webmasterworld.com/forum88/7288.htm&quot; rel=&quot;nofollow&quot;&gt;example script&lt;/a&gt;).  I rewrote this for my own stuff to implement a captcha for IP&#039;s that trip the filter.  To go a step further you should regularly pull down lists of known proxies and completely ban them via .htaccess in Apache.
</description>
		<content:encoded><![CDATA[<p>If you have a site that you want me to stop scraping, placing some css hidden text will not stop me.  I simply alter my regular expressions to remove that injected text.  (You have to wrap the text with some tag such as a div in order to hide it from the end user)</p>
<p>Not to mention, Tabke is anti anything that adds bulk to the pages on Webmasterworld.  The only way to truly stop rogue bots is to track hits from IP&#8217;s and block to many accesses within an interval (<a href="http://www.webmasterworld.com/forum88/7288.htm">example script</a>).  I rewrote this for my own stuff to implement a captcha for IP&#8217;s that trip the filter.  To go a step further you should regularly pull down lists of known proxies and completely ban them via .htaccess in Apache.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Russ Jones</title>
		<link>http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/comment-page-1/#comment-440</link>
		<dc:creator>Russ Jones</dc:creator>
		<pubDate>Wed, 23 Nov 2005 01:01:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2005/11/22/webmasterworld-disallows-all-search-engine-bots-in-robotstxt-wtf/#comment-440</guid>
		<description>While I think his robots.txt reasoning is foolish, using Google&#039;s cached results seems to be a much more effective way of scraping a site on a keyword-by-keyword basis: especially for article syndication sites. Preventing indexing may be valuable in that regard.

To truly prevent scraping, though, if search rankings are not essential, could be accomplished by inserting between every other word (or so) some css hidden text, thus making the scraped content wholly unintelligible, moving the scraper onto another less-well-protected site.</description>
		<content:encoded><![CDATA[<p>While I think his robots.txt reasoning is foolish, using Google&#8217;s cached results seems to be a much more effective way of scraping a site on a keyword-by-keyword basis: especially for article syndication sites. Preventing indexing may be valuable in that regard.</p>
<p>To truly prevent scraping, though, if search rankings are not essential, could be accomplished by inserting between every other word (or so) some css hidden text, thus making the scraped content wholly unintelligible, moving the scraper onto another less-well-protected site.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

