PHPCrawl webcrawler library

Docs for version 0.8x

About PHPCrawl


PHPCrawl is a set of classes written in PHP for crawling/spidering websites, so just call it a webcrawler-library or crawler-engine for PHP.

PHPCrawl "spiders" websites and passes information about all found documents (pages, links, files ans so on) for futher processing to users of the library.

It provides several options to specify the behaviour of the crawler like URL- and Content-Type-filters, cookie-handling, robots.txt-handling, limiting options, multiprocessing and much more.

PHPCrawl is completly free opensource software and is licensed under the GNU GENERAL PUBLIC LICENSE v2.

To get a first impression on how to use the crawler you may want to take a look at the quickstart guide or an example. A complete reference and documentation of all available options and methods can be found in the classreferences-section

The current version of the phpcrawl-package and older releases can be downloaded from a sourceforge-mirror.

Note to users of phpcrawl version 0.7x or before: Although in version 0.8 some method-names and parameters have changed, it should be fully compatible to older versions of phpcrawl.