the power of Open Source and the Amazon cloud

I have been a big fan of the Amazon S3/EC2 services for some time now. Every start-up company I work with or know friends who work there, are all using this service to offload the expensive and time-consuming work of building your own data centre. Now these are all nice, easy to understand scenarios - offload the hosting and storage of products for people to download. Makes distributing your builds and other artifacts really simple.

What would you do if you had to make available over 11 million articles of the NY Times from 1851-1980 as PDFs, and to make things interesting, the really old stuff is only available as image data, meaning, you need to generate the PDFs as well? Well, seems one Derek Gottfrid from the NY Times has managed to figure this out, and documented his exploits for us to learn from.

It's an incredible story about taking huge risks and the payoff at the end of the tunnel. Of particular interest is the use of Open Source API's and technologies like Hadoop, JetS3t, and Xen. Only through the agility gained by the use of Open Source tools was this all possible in the short time it seems to have taken Derek to pull this off. Can you imagine if he had contacted IBM for help on this project? They would have parachuted in a team of 10 'top engineers' and spent weeks defining the project, and then prototype, budgeted for $1 million in IBM servers and then another $1 million in IBM services to implement. I'm sure it would have been on-time also ;-)

Seriously though, if you have skills at what you're doing, are not afraid to empower yourself with open source tools and API's, the results can be dramatic - as the NY Times Open Source blog can point out for you.

A follow-up post from Derek can be found here where he covers the NY Times service built on top of this deployment, called TimesMachine, which lets you navigate the archives using a calendar like navigator. You can watch Derek discussing on video how they worked through all of this.


    web site hit counter