Using multiple processes
Since version 0.8 phpcrawl is able to use multiple processes to spider a website. In most cases using more processes simultaneously will speed up the crawling-procedure dramatically.
In order to start phpcrawl in multi-process-mode, simply call the goMultiProcessed()-method instead of the go()-method to start the crawler.
$crawler = new MyCrawler();
$crawler->setURL("www.foo.com");
$crawler->addContentTypeReceiveRule("#text/html#");
// ...
// Start crawling by using 5 processes
$crawler->goMultiProcessed(5);
- Some PHP-extensions are required to successfully run phpcrawl in multi-process mode (PCNTL-extension, SEMAPHORE-extension, PDO-extension). For more details see the requirements page.
- The multi-process mode only works on unix/linux-based systems
- Scripts using phpcrawl with mutliple processes have to be run from the commandline (php CLI)
- Increasing the number of processes to very high values does't automatically mean that the crawling-process will go off faster!
The ideally number of processes depends on a lot of circumstances like the available bandwidth, the local technical environment (CPU),
the delivery-rate and data-rate of the server hosting the taget-website and so on.
Using something between 3 to 10 processes should be good values to start from.