Tutorial: Spidering huge websites
By default, phpcrawl is using local memory (RAM) for internally caching/queueing found URLs and other data. So when crawling large websites consisting of thousands of pages, the php-memory-limit or the memory-limit in general may be hit at some time.
But since version 0.8, phpcrawl alternatively is able to use a SQLite database-file for internally caching URLs. When activating this typ of caching, it shoudln't be a problem anymore to spider huge websites.
To activate the SQLite-cache, simply use the following setUrlCacheType()-setting:
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
To increase performance of the SQLite-cache you may set it's location to a shared-memory device like "/dev/shm/" (for Debian/Ubuntu) by using the setWorkingDirectory()-method.
$crawler->setWorkingDirectory("/dev/shm/");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);