Parsing websites with curl and phpQuery

A while ago I had to crawl some websites to gather information about products. In the past I’ve used RegExp to parse the HTML, knowing it’s not the best method, but I just felt that PHP’s DOMDocument was clumsy.

I started coding the crawler with CakePHP 2.5.x and the following classes: electrolinux/phpquery and php-curl-class/php-curl-class.

The php-curl-class is pretty straight forward, it’s just easier to work with curl with it. In addition, the phpQuery is a library that let’s you use CSS3 selectors just like you do with jQuery.

I know it’s lame, but as example let’s get the title of SaveWalterWhite.

<?php
$curl = new \Curl\Curl();
$curl->get("http://www.savewalterwhite.com");
$pq = phpQuery::newDocument($curl->response);
echo $pq->find('title')->text();
?>

Obviously you can do more complex stuff, like getting all the image paths that are inside list items of the #walter-container div.

<?php
$curl = new \Curl\Curl();
$curl->get("http://www.savewalterwhite.com");
$pq = phpQuery::newDocument($curl->response);
for ($i=1;$i
$pics = $pq->find('div#walter-container li img')->attr('src');
if (!empty($pics)) { var_dump($pics); } 
?>

You can also use the selector on an iteration like this:

<?php
for ($i=1;$i<=$limit;$i++)
{
 $pics = $pq->find('div#product-detail ul li:nth-child('.$i.') a')->attr('data-image-zoom');
 if (!empty($pics))
 {
 $images[] = $pics;
 }
 }
?>

Checkout the phpQuery manual for further information. This class is handy and saved me a lot of time.