Description:
When using this method instead of the go()-method to start the crawler, phpcrawl will use the given
number of processes simultaneously for spidering the target-url.
Using multi processes will speed up the crawling-progress dramatically in most cases.
There are some requirements though to successfully run the cralwler in multi-process mode:
- The multi-process mode only works on unix-based systems (linux)
- Scripts using the crawler have to be run from the commandline (cli)
- The PCNTL-extension for php (process control) has to be installed and activated.
- The SEMAPHORE-extension for php has to be installed and activated.
- The POSIX-extension for php has to be installed and activated.
- The PDO-extension together with the SQLite-driver (PDO_SQLITE) has to be installed and activated.
PHPCrawls supports two different modes of multiprocessing:
- PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE
The cralwer uses multi processes simultaneously for spidering the target URL, but the usercode provided to
the overridable function handleDocumentInfo() gets always executed on the same main-process. This
means that the usercode never gets executed simultaneously and so you dont't have to care about
concurrent file/database/handle-accesses or smimilar things.
But on the other side the usercode may slow down the crawling-procedure because every child-process has to
wait until the usercode got executed on the main-process. This ist the recommended multiprocess-mode!
- PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE
The cralwer uses multi processes simultaneously for spidering the target URL, and every chld-process executes
the usercode provided to the overridable function handleDocumentInfo() directly from it's process. This
means that the usercode gets executed simultaneously by the different child-processes and you should
take care of concurrent file/data/handle-accesses proberbly (if used).
When using this mode and you use any handles like database-connections or filestreams in your extended
crawler-class, you should open them within the overridden mehtod initChildProcess() instead of opening
them from the constructor. For more details see the documentation of the initChildProcess()-method.
Example for starting the crawler with 5 processes using the recommended MPMODE_PARENT_EXECUTES_USERCODE-mode:
$crawler->goMultiProcessed(5, PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE);
Please note that increasing the number of processes to high values does't automatically mean that the crawling-process
will go off faster! Using 3 to 5 processes should be good values to start from.