tony spencer » Code It's Just Links Sat, 19 Oct 2013 14:31:31 +0000 hourly 1 Ruby Could Replace my Python Crawler Pretty Soon Mon, 28 Jul 2008 19:30:30 +0000 tony One of my developers just sent me some truly incredible stats about Ruby 1.9 and its threading performance.

20 threads * 100,000 iterations
Ruby 1.9 = 1.54 s.
Ruby Enterprise = 3.01 s.
JRuby 1.1.2 = 5.82 s.
Jython 2.2.1 = 11.86 s.
Python 2.5.2 = 12.32 s.
Ruby 1.8.7 = 22.68

Since our attempt at testing Ruby as a crawler really wasn’t all that much slower than Python it could be really interesting to see what will happen with Ruby 1.9.

The blog post about the test (Its in Polish)

]]> 6
Python is Ugly but Damn She’s Beautiful Thu, 27 Mar 2008 03:56:57 +0000 tony Remember the Python crawler NotSleepy built to suck up all your internets and find your affiliate IDs? Well we kept massaging the code and finally slapped that thing down on a fat pipe. WOW. The stats are rocking now. How about double time!

Latest Stats:
35.6 URLs per second
3.073 Million URLs per day!

Whats most promising is that the new fat pipe is still the bottleneck which means that if anybody really wants to party, all we need to do is lay down some greenbacks and a OC-12 will show us mass terabyte pleasure.

]]> 2
Why Basecamp Sucks… Wed, 12 Mar 2008 19:02:07 +0000 tony





Come on guys. Its not that freaking hard. Here, let me help you get started:

script/generate migration add_status_comments_and_assigned_user_to_To_Dos
]]> 11
Easy Solution for Conflicting Rails Migration Version Numbering Sun, 02 Mar 2008 16:19:13 +0000 tony migrations numbers
Thats a crappy blog post title but the best I could come up with! You know the scenario: you are about to commit your latest Rails code to subversion and you perform an update first. Rats. Someone has committed a new migration with the same number as yours. So you mess around with SQL manually reversing changes, and rename files. Its a pain, and I think this single problem with Rails causes much stress because as Rails developers we are used to everything working so smoothly. We’ll Steve Purcell has solved this problem in a beautiful way.

Install his plugin ‘renumber_migrations’:
script/plugin install

Next time you run into this migrations mess:
rake db:migrate:renumber

Problem solved!

One note: for some reason I couldn’t get it to work until I removed line 18:
raise "This task currently supports only subversion projects"

Don’t know why he added that line but once it was removed it worked perfectly. Thank you very much Steve! Now if someone will write a nice script to setup a bunch of common ignore properties (log/, schema.rb, tmp/) in SVN when first importing a new Rails project…. :)

]]> 1
Crazy Python Crawler Mon, 07 Jan 2008 18:23:07 +0000 tony Someone emailed me doubting my crawler could operate at the speeds I posted last week so here is a video I took this morning. I should have waited a few minutes after launching it before starting the video as it really starts cranking once all the threads get rocking and you can see that near the end of the video. Also notice my streaming internet radio going in and out thanks to no available bandwidth left on my 5Mbps line.

You can also hear a ticking sound. That is my new 1TB drive. It makes these weird ticking noises even when its not in use. REally sounds like the arm hitting something its not supposed to hit. Hope its not defective.

Video link

]]> 4
Big SEO’s with Crawlers: Lets See Your Stats Fri, 04 Jan 2008 04:07:39 +0000 tony OK I’m just ecstatic with my new crawler, I think nobody but Google has one better than me, and I’m ready for a good old fashion show-and-tell. Multi-threaded programming is a bear to deal with and I’ve written several crawlers in different languages. For years I’ve been plagued with several complex problems:

* Complex code that is difficult to maintain and difficult to setup on a server
* Memory leakage
* Configurability

So the latest design is just 192 lines of Python in a single file, has a single configuration file, and takes about 5 minutes to setup on a standard Linux machine. I ran it last night and was delighted with the results:

Test Run
Tested 139,740 urls
Completed in 2 hrs, 13 mins
3.6 GB of html
Average filesize: 25.05 KB

18.2 urls/second
1.572 million urls/day

Hardware and Environment
3 year old Dell Poweredge SC240
Pentium 4
3.5 GB of RAM
Average CPU load: 0.16
Average physical RAM used: 950 MB
OS: Ubuntu 7.10 (Gutsy Gibbon)
Filesystem: ReiserFS 3

Network connection:
Residential cable modem 5Mbps down (of which 100% is consumed when its running so likely to be faster on a fatter pipe)

Even better this code is infinitely extensible. We’ll spread it across as many machines as necessary to download the entire internet.

Big SEO’s with Crawlers… what are your stats?

]]> 9
Ad Blockers can Ruin Your Legitimate Web App that Isn’t Even Serving Ads Wed, 21 Nov 2007 19:17:34 +0000 tony Since rebranding some of our old classifieds sites and relaunching the system as in a newly built Ruby on Rails app we’ve received a handful of emails complaining about strange behavior that always involved links not appearing for the user.

How do you read the rest of the postings or see any pictures that were uploaded?!?! There are no links on the classifieds to keep reading them. Please help since I am new to the website.

At first I discounted this as user error. “These fools don’t know how to use the internets!” DELETE.

After getting several more of these I became concerned and managed to get a few users to send screenshots and HTML source. We were all stumped. The page was fully loaded except the links to the classified ads were missing. There were no errors in the logs.

Finally I posted the mystery to my fantastic local Ruby group and Chris Garrett (not the SEO one) had a fantastic suggestion:

I just came across something else in my hunting. It could be an
ad-blocking plugin. See if the users have some common plugin in their
browser that hides ads. Also, see if there is some pattern to the
links that are disappearing – e.g. some keyword or URL pattern.

Did a bit of Googling and sure enough Norton Internet Security takes a very heavy handed approach to blocking ads on sites:

Ad Blocking maintains a list of more than 200 HTML strings that are associated with advertisements…….

For example, Ad Blocking prevents Web pages whose URL includes from being displayed because the URL includes the HTML string “AD.”

ad block

And our URL’s are structured with the word ad in the URL:


Created a page to test the theory and asked the most recent user to check it. He validated the test and confirmed that he did indeed have Norton Internet Security installed and running on his machine.

So be careful when naming your URL routes and avoid the use of the phrase “ad” or “banner”! Much, MUCH thanks to Chris Garrett for thinking outside of the box and to Curt Rabon of Blue Lizard Technologies for pointing out the problem and allowing me to use him for a guinea pig.

Lookout Googlebot. We’re going to be serving a LOT of 301 redirects shortly.

]]> 2
New columns not immediately available in migrations Wed, 04 Jul 2007 16:43:22 +0000 tony Sometimes you add a column to a table in a migration and then you want populate the new column with some data. Run your migration and while your column has been created in the database, your data does not populate. The problem is that those columns are not accessible via ActiveRecord and so you just need to tell it to update itself:

add_column :user, :favorite_beer, :string
User.reset_column_information  #<<<<<<<< Here is the ActiveRecord reload
tony = User.find_by_name "Tony Spencer"
tony.favorite_beer = "Terrapin Rye Pale Ale"
]]> 4
Lighthouse Bug Tracking Review Thu, 21 Jun 2007 18:34:14 +0000 tony We’ve been using Basecamp for some time now to manage multiple projects and I have really enjoyed it except for the lack of integrated issue/bug tracking. I’ve tried hacking to-do lists and categorizing messages but I just can’t make Basecamp work for our issue tracking even though I don’t need fancy features. I just want to rapidly log/assign issues to team members, change status, and reassign back to me when the issue is completed.

For years I’ve been using Mantis and it works but its quirky and rather slow to work with as the interface isn’t designed all too well. There is also some stupid bug that makes it impossible for me to sort issues by different columns. I’ve just signed up for Lighthouse and here are a few pros and cons I’ve noticed immediately:

  • As a technical manager I like to be able to enter bugs/issues quickly w/out using the mouse. Basecamp to-do lists are very nice this way as I can quickly type, tab, and hit space bar to enter an item and assign it to someone. The create ticket feature forces me to pickup the mouse and click several places which slows things down. It would also be very nice if it tickets were created with AJAX as to-do items in BC are done so I can very quickly fill up peoples queue . (Hey my guys work fast so I have to enter bugs fast!)
    new issue
  • It’s not very apparent which project I’m currently managing. Only the small drop down on the right lets me know. I wish Lighthouse would make the current project name more prominent like in Basecamp. Also it would be quicker to bounce around between projects if they were a list of links rather than a select list.
  • There is no issue tracking in Basecamp which is why I am giving this nice looking app a try. However, I would continue to use Basecamp for other aspects of the project. It would be great if they could drop in my URL to a project in Basecamp when I create the project in Lighthouse so it could provide me that link in the right nav so I could jump back there.
  • I like the ability to add an avatar to users in Lighthouse. Helps to make it easier to see who did what and gives it a personal touch.
  • The “feature updates” box is taking up too much of the real estate on every page and never goes away.
    new issue
  • The top header is a little too big and is wasting space above the fold hindering me from seeing more without scrolling.
  • I like the ability to pay with PayPal subscription which got me up and running very quickly
  • The ability to create a simple “Page” is nice. Currently we have a writeboard in one project in Basecamp that we keep all info about our server setup in such as gems to install, cron jobs, where files exist, and how to deploy. The problem with that is I can’t share it with everyone without adding everyone to that project and it really isn’t specific to that one project. Pages solves that in Lighthouse. I will now also add pages like coding best practices, and subversion how to’s.

I know I published a lot of negatives here but on the whole I’m liking this hosted app and would love to get away from stinking Mantis and managing my own bug tracking system. I’ll post more updates as we use it more.

Update to Lighthouse Issue Tracking

It looks like they removed the banner that was wasting space which is nice. However, one BIG problem I discovered:

I cannot use a “pre” tag to drop in HTML and not have it rendered by the browser which makes it very hard for me to show a designer or developer some html I want them to use.

Also I can now tab to the field where you select a user to assign a ticket to but I still cannot change that field without picking up the mouse and clicking on it.

Damn I wish there were a simple interface for entering bugs that looked something like this :)


]]> 5
PHP vs. Ruby on Rails Sat, 09 Jun 2007 15:48:14 +0000 tony Again, not funny if you are stuck in PHP land trying to sync up with your team’s latest database changes:

]]> 3