PHPCrawl webcrawler library for PHP - Using multiple processes

Using multiple processes

Since version 0.8 phpcrawl is able to use multiple processes to spider a website. In most cases using more processes simultaneously will speed up the crawling-procedure dramatically.

In order to start phpcrawl in multi-process-mode, simply call the goMultiProcessed()-method instead of the go()-method to start the crawler.

$crawler = new MyCrawler();
$crawler->setURL("www.foo.com");
$crawler->addContentTypeReceiveRule("#text/html#");

// ...

// Start crawling by using 5 processes
$crawler->goMultiProcessed(5);

However, there are some things to consider when using the multi-process mode:

Some PHP-extensions are required to successfully run phpcrawl in multi-process mode (PCNTL-extension, SEMAPHORE-extension, PDO-extension). For more details see the requirements page.
The multi-process mode only works on unix/linux-based systems
Scripts using phpcrawl with mutliple processes have to be run from the commandline (php CLI)
Increasing the number of processes to very high values does't automatically mean that the crawling-process will go off faster! The ideally number of processes depends on a lot of circumstances like the available bandwidth, the local technical environment (CPU), the delivery-rate and data-rate of the server hosting the taget-website and so on.
Using something between 3 to 10 processes should be good values to start from.

PHPCrawl webcrawler library/framework

Using multiple processes