PHPCrawl webcrawler library/framework

Tutorial: Spidering huge websites


By default, phpcrawl is using local memory (RAM) for internally caching/queueing found URLs and other data. So when crawling large websites consisting of thousands of pages, the php-memory-limit or the memory-limit in general may be hit at some time.

But since version 0.8, phpcrawl alternatively is able to use a SQLite database-file for internally caching URLs. When activating this typ of caching, it shoudln't be a problem anymore to spider huge websites.

To activate the SQLite-cache, simply use the following setUrlCacheType()-setting:

$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);

By default, the SQLite-database-file will be placed in the systems default temporary directory on the local harddrive.
To increase performance of the SQLite-cache you may set it's location to a shared-memory device like "/dev/shm/" (for Debian/Ubuntu) by using the setWorkingDirectory()-method.

$crawler->setWorkingDirectory("/dev/shm/");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);

Please note that the PHP PDO-extension together with the SQLite-driver (PDO_SQLITE) has to be installed and activated to use this type of caching.