<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Big SEO&#8217;s with Crawlers: Lets See Your Stats</title>
	<atom:link href="http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/</link>
	<description>It&#039;s Just Links</description>
	<lastBuildDate>Wed, 14 Sep 2011 13:47:04 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Paulo Ricardo</title>
		<link>http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-177405</link>
		<dc:creator>Paulo Ricardo</dc:creator>
		<pubDate>Fri, 31 Jul 2009 12:10:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-177405</guid>
		<description>How I can got this code? Whether free or paid. If you charge for this I&#039;m willing to pay for it.</description>
		<content:encoded><![CDATA[<p>How I can got this code? Whether free or paid. If you charge for this I&#8217;m willing to pay for it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: tony</title>
		<link>http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-66988</link>
		<dc:creator>tony</dc:creator>
		<pubDate>Tue, 25 Mar 2008 16:02:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-66988</guid>
		<description>@wheel
Nutch is indeed impressive but damn is it difficult to get up and running.  I spent some time exploring it to see if I could mould it to fit my needs and decided it would require too much hacking and would violate one of my requirements of a simple setup.</description>
		<content:encoded><![CDATA[<p>@wheel<br />
Nutch is indeed impressive but damn is it difficult to get up and running.  I spent some time exploring it to see if I could mould it to fit my needs and decided it would require too much hacking and would violate one of my requirements of a simple setup.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: wheel</title>
		<link>http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-53012</link>
		<dc:creator>wheel</dc:creator>
		<pubDate>Thu, 31 Jan 2008 12:32:41 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-53012</guid>
		<description>Hey Tony,

I&#039;ve had nutch throttled back and it still filled a 30-40megabit line.  That&#039;s off of one server.

I forget exactly how much it downloaded, but it was in the range of the number of url&#039;s you&#039;re talking about.  And again, that was throttled back.  Not sure what it would have done wide open.</description>
		<content:encoded><![CDATA[<p>Hey Tony,</p>
<p>I&#8217;ve had nutch throttled back and it still filled a 30-40megabit line.  That&#8217;s off of one server.</p>
<p>I forget exactly how much it downloaded, but it was in the range of the number of url&#8217;s you&#8217;re talking about.  And again, that was throttled back.  Not sure what it would have done wide open.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: tony</title>
		<link>http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-50618</link>
		<dc:creator>tony</dc:creator>
		<pubDate>Sat, 19 Jan 2008 21:12:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-50618</guid>
		<description>@Ramon
Its only downloading HTML for SEO analysis</description>
		<content:encoded><![CDATA[<p>@Ramon<br />
Its only downloading HTML for SEO analysis</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ramon</title>
		<link>http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-50611</link>
		<dc:creator>Ramon</dc:creator>
		<pubDate>Sat, 19 Jan 2008 19:45:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-50611</guid>
		<description>&quot;this code is purely retrieving content from the web&quot; -- so is this essentially a site downloader, i.e. functionally similar / identical to what HTTrack does (http://www.httrack.com/)</description>
		<content:encoded><![CDATA[<p>&#8220;this code is purely retrieving content from the web&#8221; &#8212; so is this essentially a site downloader, i.e. functionally similar / identical to what HTTrack does (<a href="http://www.httrack.com/">http://www.httrack.com/</a>)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Vivevtvivas</title>
		<link>http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-49073</link>
		<dc:creator>Vivevtvivas</dc:creator>
		<pubDate>Sun, 13 Jan 2008 01:16:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-49073</guid>
		<description>This is the coolest thing I&#039;ve heard of in a while.  I was wondering what it would take to write a crawler after a project that I was interested in pursuing took me in that direction.  I never did anything with it but this is fascinating reading.

p.s.  Love your theme...</description>
		<content:encoded><![CDATA[<p>This is the coolest thing I&#8217;ve heard of in a while.  I was wondering what it would take to write a crawler after a project that I was interested in pursuing took me in that direction.  I never did anything with it but this is fascinating reading.</p>
<p>p.s.  Love your theme&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: tony</title>
		<link>http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-47377</link>
		<dc:creator>tony</dc:creator>
		<pubDate>Fri, 04 Jan 2008 20:40:16 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-47377</guid>
		<description>@Jordan
The pages are stored in flat file all in the same directory and are named by ID.  Usually performance gets pretty lousy when you put too many files in one dir but after reading this article on filesystem performance I tried formatting my storage drive in ReiserFS3 and it was all good. Before that I used to use the hashing subdirectory method that blogger mentions which works well but if you don&#039;t have to do all that checking within the crawler you can go that much faster.
http://ygingras.net/b/2007/12/too-many-files%3A-reiser-fs-vs-hashed-paths

@Michael
I attempted the same thing in Ruby but could never achieve stellar results from it. I&#039;m considering releasing the code. As for the checking, this code is purely retrieving content from the web.  Other processes will come into play after retrieval for some of the things you are speaking of as well as dumping the content in to a lucene index.</description>
		<content:encoded><![CDATA[<p>@Jordan<br />
The pages are stored in flat file all in the same directory and are named by ID.  Usually performance gets pretty lousy when you put too many files in one dir but after reading this article on filesystem performance I tried formatting my storage drive in ReiserFS3 and it was all good. Before that I used to use the hashing subdirectory method that blogger mentions which works well but if you don&#8217;t have to do all that checking within the crawler you can go that much faster.<br />
<a href="http://ygingras.net/b/2007/12/too-many-files%3A-reiser-fs-vs-hashed-paths">http://ygingras.net/b/2007/12/too-many-files%3A-reiser-fs-vs-hashed-paths</a></p>
<p>@Michael<br />
I attempted the same thing in Ruby but could never achieve stellar results from it. I&#8217;m considering releasing the code. As for the checking, this code is purely retrieving content from the web.  Other processes will come into play after retrieval for some of the things you are speaking of as well as dumping the content in to a lucene index.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Thompson</title>
		<link>http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-47363</link>
		<dc:creator>Michael Thompson</dc:creator>
		<pubDate>Fri, 04 Jan 2008 18:18:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-47363</guid>
		<description>So we get to hear you brag but not see the code? ;)

At my last job on of the developers did a similar thing in Ruby, while the rest of us toiled in Perl to get something similar. I&#039;d say that you&#039;ve got quite a winner on your hands.

When you say you&#039;re testing URLs, what exactly are you testing? Are you checking for keywords or regex matches, or are you verifying that a found URL exists and then pulling URLs from the (new) URL?</description>
		<content:encoded><![CDATA[<p>So we get to hear you brag but not see the code? <img src='http://www.tonyspencer.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>At my last job on of the developers did a similar thing in Ruby, while the rest of us toiled in Perl to get something similar. I&#8217;d say that you&#8217;ve got quite a winner on your hands.</p>
<p>When you say you&#8217;re testing URLs, what exactly are you testing? Are you checking for keywords or regex matches, or are you verifying that a found URL exists and then pulling URLs from the (new) URL?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jordan Glasner</title>
		<link>http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-47343</link>
		<dc:creator>Jordan Glasner</dc:creator>
		<pubDate>Fri, 04 Jan 2008 16:06:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-47343</guid>
		<description>Your per day number is pretty ridiculous. Do you mind going into how you&#039;re storing the pages?</description>
		<content:encoded><![CDATA[<p>Your per day number is pretty ridiculous. Do you mind going into how you&#8217;re storing the pages?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

