How-To (and How-Not-To) on Web Scraping

A friend of mine who shall remain nameless pointed a post out to me on the PHP DZone web site recently. Noting that the article's content was misinformed at best and downright ignorant at worst, even when examining it sheerly from the author's knowledge of PHP as a language, this friend asked that I set the author straight.

I gladly obliged with a comment on the post, having become somewhat of an authority on the application topic myself. As much of an unorthodox practice as web scraping may be, there are some methodologies for it that are obviously better than others. The aforementioned post illustrates a lot of the ones to avoid, and my arguments against them.

Later, I randomly encountered a post on the blog at xml.lt on the topic of web scraping using the DOM extension. This article showcases recommended practices and reasoned arguments against bad (and unfortunately common) alternatives. The author comes across as being significantly more informed on both the language and the application in the article's content and code examples.

If you're looking for references on topic of web scraping with PHP, there's always the article I wrote for the December 2007 issue of php|architect magazine, of which you can still purchase an electronic copy in PDF format. At some point, I also hope to write a short book on the subject. Until then, if you have related questions, you can generally reach me in the #phpc channel on Freenode, under the nick Elazar. I'm always glad to give out advice on web scraping and PHP, as I'm sure my good friend Jared Folkins (who is also my "Little Sis" from the PHPWomen Big Sis/Little Sis mentoring program) will attest.

Book Review: PHP Web 2.0 Mashup Projects

You can find this review in podcast form on the Zend Developer Zone PHP Abstract Podcast.

I received an e-mail recently from a very nice gentleman at Packt Publishing, a UK-based publishing company focused on providing hands-on application-oriented publications to IT professionals, particularly those specific to open source technologies. Their representative asked if I would be willing to review one of their books, namely PHP Web 2.0 Mashup Projects by Shu-Wai Chow. Reviewing books is not something I had done before, so I thought I would give it a good old-fashioned college try.

In a supersaturated market, it is difficult to make an impression with a PHP book these days. The books of real value are those that focus on ways to apply the language to real world problems. These books delve into the depths of a particular application domain, showing PHP code and outlining design principles along the way. They are useful to current and prospective PHP programmers alike because they can introduce both not to PHP itself, but to an existing class of problems and how PHP can be applied to solve them. PHP Web 2.0 Mashup Projects is one of these books.

Most technology-related books on the shelves are several inches thick and an inherently daunting chore to sift through. Luckily, this book is not one of those. Do not let the size fool you, though; it is positively packed with useful information. It hits the high points of each topic it covers, giving you enough in the way of code samples and step-by-step explanations to get started, as well as resources to help you get better acquainted with topics that might be of particular interest to you.

The book is divided into six chapters, each of which covers a set of particular protocols, data formats, and APIs for acquiring and processing data in order to create a particular mashup application. These projects include:

  • A search engine to find products on Amazon by their Universal Product Code
  • A search engine to combine results from MSN and Yahoo!
  • A video jukebox that pulls songs from Last.fm and videos from YouTube
  • A traffic incident reporting application that sends SMS alerts
  • An illustrated tube station line map using Google Maps and Flickr for related photos

The book's structure and layout make it easy to follow, whether you prefer to read it linearly or jump around to specific sections. It is an excellent reference that I can see myself returning to time and time again.

One of the strengths of the book is that it has a very wide base of coverage. It starts by introducing basics in interacting with web services and extracting the desired data from their responses using core PHP libraries. The REST, XML-RPC, and SOAP protocols and the WSDL standard are all covered in enough depth to get you started, so you can work with a web service regardless of the protocol or protocols it offers. The author does an excellent job of selecting example web services and data standards from large and well-known to small and obscure. For real world APIs, you will find the likes of Amazon, YouTube, Google, and Flickr, as well as sources that might not be household names, such as the Internet UPC Database. Data standards include general formats like XML, RDF, and JSON and more specialized formats like RSS and XSPF.

Another strength is that the book encourages good principles from the start. It advocates object-oriented design principles for code reuse and a DRY philosophy. It suggests using third-party libraries such as those in PEAR in order to avoid unnecessary reinvention of the wheel, but still shows you how to roll your own if and when it becomes necessary. The books also covers usability, particularly in the last chapter when it discusses AJAX and race conditions, and pays special attention to application security, an area of increasing concern in web applications. Unlike some books, this one includes tips for development outside its own showcased projects to alleviate you from having to spend your own time troubleshooting common issues or digging for solutions to "gotcha" situations.

And last but certainly not least, the book demonstrates that sometimes you have to be resourceful in locating and acquiring your data, particularly in Chapter 5 where one of my own areas of interest, web scraping, is covered. The topic is explained in plain language and supplemented with examples walking you through exactly how it can be used to acquire data for your own mashups. Web scraping is not a frequently broached topic and I applaud the author for making a point to include it. I believe it is a genuinely useful methodology that can help in data acquisition when no other options are available.

I cannot give the book an entirely glowing review, though. There are some errata present, both in content and code samples. Most are small, but some are enough to throw off a reader not already familiar with the material being covered. I've submitted some of these via the publisher's web site already, though I have yet to receive any related communications or see them show up on the web site at the time that I write this review. These issues are able to be corrected, though, and the quality of the book's content outshines them.

Overall, PHP Web 2.0 Mashup Projects is an excellent example of creativity in finding new ways to aggregate data sets in useful combinations. It is a testament to the possibilities of the internet when access to data is opened up and freedom to use that data enables developers to create exciting and inspiring new solutions. Mashups show the internet's potential increasing in leaps and bounds and this book can get you on your way to contributing to their future development.

Web Scraping Article Published

Just a quick post to announce (albeit a little late) the December 2007 issue of php|architect, which includes my article on web scraping. Please buy a copy, give it a read, and feel free to post comments on the forum thread for the article. I'd love to hear some reader feedback!

You may noticed that I've added a new page for publications. This will become the home for any content I produce that gains any sort of recognition, be it a podcast, article, book review, presentation slides, or what have you. Anytime anything new goes there, I'll try to make a point to write a post about it.

Article for php|architect

One of the things that has kept me away from my blog for the past few weeks is an article I've been working on for php|architect magazine. It should be included in the December 2007 issue and is entitled "Web Scraping." So, if the topic interests you, keep an eye out for it. If you aren't sure if the topic interests you, you can check out my episode on the Zend Developer Zone PHP Abstract podcast for a brief high-level description. I'll probably post about this again once the issue comes out, but I thought I'd give a heads up to anyone out there that might buy issues of the magazine on an issue-by-issue basis.

PHP Abstract Episode 22: Screen Scraping

Check out the latest PHP Abstract podcast (episode 22) from Dev Zone. I'm the guest speaker! The podcast is on web scraping, a practice in which I have (unfortunately) become somewhat proficient. Leave a comment on Dev Zone or on this entry and let me know what you think!

Page:  1