PHPCrawl webcrawler library/framework

Using multiple processes


Since version 0.8 phpcrawl is able to use multiple processes to spider a website. In most cases using more processes simultaneously will speed up the crawling-procedure dramatically.

In order to start phpcrawl in multi-process-mode, simply call the goMultiProcessed()-method instead of the go()-method to start the crawler.

$crawler = new MyCrawler();
$crawler->setURL("www.foo.com");
$crawler->addContentTypeReceiveRule("#text/html#");

// ...

// Start crawling by using 5 processes
$crawler->goMultiProcessed(5);

However, there are some things to consider when using the multi-process mode:
  • Some PHP-extensions are required to successfully run phpcrawl in multi-process mode (PCNTL-extension, SEMAPHORE-extension, PDO-extension). For more details see the requirements page.
  • The multi-process mode only works on unix/linux-based systems
  • Scripts using phpcrawl with mutliple processes have to be run from the commandline (php CLI)
  • Increasing the number of processes to very high values does't automatically mean that the crawling-process will go off faster! The ideally number of processes depends on a lot of circumstances like the available bandwidth, the local technical environment (CPU), the delivery-rate and data-rate of the server hosting the taget-website and so on.
    Using something between 3 to 10 processes should be good values to start from.