Home > Code, Crawlers, Python > Big SEO’s with Crawlers: Lets See Your Stats

Big SEO’s with Crawlers: Lets See Your Stats

January 3rd, 2008 tony Leave a comment Go to comments

OK I’m just ecstatic with my new crawler, I think nobody but Google has one better than me, and I’m ready for a good old fashion show-and-tell. Multi-threaded programming is a bear to deal with and I’ve written several crawlers in different languages. For years I’ve been plagued with several complex problems:

* Complex code that is difficult to maintain and difficult to setup on a server
* Memory leakage
* Configurability

So the latest design is just 192 lines of Python in a single file, has a single configuration file, and takes about 5 minutes to setup on a standard Linux machine. I ran it last night and was delighted with the results:

Test Run
Tested 139,740 urls
Completed in 2 hrs, 13 mins
3.6 GB of html
Average filesize: 25.05 KB

18.2 urls/second
1.572 million urls/day

Hardware and Environment
3 year old Dell Poweredge SC240
Pentium 4
3.5 GB of RAM
Average CPU load: 0.16
Average physical RAM used: 950 MB
OS: Ubuntu 7.10 (Gutsy Gibbon)
Filesystem: ReiserFS 3

Network connection:
Residential cable modem 5Mbps down (of which 100% is consumed when its running so likely to be faster on a fatter pipe)

Even better this code is infinitely extensible. We’ll spread it across as many machines as necessary to download the entire internet.

Big SEO’s with Crawlers… what are your stats?

Categories: Code, Crawlers, Python Tags:
  1. Jordan Glasner
    January 4th, 2008 at 11:06 | #1

    Your per day number is pretty ridiculous. Do you mind going into how you’re storing the pages?

  2. January 4th, 2008 at 13:18 | #2

    So we get to hear you brag but not see the code? ;)

    At my last job on of the developers did a similar thing in Ruby, while the rest of us toiled in Perl to get something similar. I’d say that you’ve got quite a winner on your hands.

    When you say you’re testing URLs, what exactly are you testing? Are you checking for keywords or regex matches, or are you verifying that a found URL exists and then pulling URLs from the (new) URL?

  3. tony
    January 4th, 2008 at 15:40 | #3

    The pages are stored in flat file all in the same directory and are named by ID. Usually performance gets pretty lousy when you put too many files in one dir but after reading this article on filesystem performance I tried formatting my storage drive in ReiserFS3 and it was all good. Before that I used to use the hashing subdirectory method that blogger mentions which works well but if you don’t have to do all that checking within the crawler you can go that much faster.

    I attempted the same thing in Ruby but could never achieve stellar results from it. I’m considering releasing the code. As for the checking, this code is purely retrieving content from the web. Other processes will come into play after retrieval for some of the things you are speaking of as well as dumping the content in to a lucene index.

  4. Vivevtvivas
    January 12th, 2008 at 20:16 | #4

    This is the coolest thing I’ve heard of in a while. I was wondering what it would take to write a crawler after a project that I was interested in pursuing took me in that direction. I never did anything with it but this is fascinating reading.

    p.s. Love your theme…

  5. Ramon
    January 19th, 2008 at 14:45 | #5

    “this code is purely retrieving content from the web” — so is this essentially a site downloader, i.e. functionally similar / identical to what HTTrack does (http://www.httrack.com/)

  6. tony
    January 19th, 2008 at 16:12 | #6

    Its only downloading HTML for SEO analysis

  7. wheel
    January 31st, 2008 at 07:32 | #7

    Hey Tony,

    I’ve had nutch throttled back and it still filled a 30-40megabit line. That’s off of one server.

    I forget exactly how much it downloaded, but it was in the range of the number of url’s you’re talking about. And again, that was throttled back. Not sure what it would have done wide open.

  8. tony
    March 25th, 2008 at 11:02 | #8

    Nutch is indeed impressive but damn is it difficult to get up and running. I spent some time exploring it to see if I could mould it to fit my needs and decided it would require too much hacking and would violate one of my requirements of a simple setup.

  9. Paulo Ricardo
    July 31st, 2009 at 07:10 | #9

    How I can got this code? Whether free or paid. If you charge for this I’m willing to pay for it.