Installation & Quickstart
The following steps show how to use phpcrawl:
- Unpack the phpcrawl-package somewhere. That's all you have to do for installation.
-
Include the phpcrawl-mainclass to your script or project. Its located in the "libs"-path of the package.
include("libs/PHPCrawler.class.php");
There are no other includes needed.
-
Extend the phpcrawler-class and override the handleDocumentInfo-method with your own code to process the information of every document the crawler finds on its way.
class MyCrawler extends PHPCrawler
For a list of all available information about a page or file within the handleDocumentInfo-method see the PHPCrawlerDocumentInfo-reference.
{
function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo)
{
// Your code comes here!
// Do something with the $PageInfo-object that
// contains all information about the currently
// received document.
// As example we just print out the URL of the document
echo $PageInfo->url."\n";
}
}
Note to users of phpcrawl 0.7x or before: The old, overridable method "handlePageData()", that receives the document-information as an array, still is present and gets called. PHPcrawl 0.8 is fully compatible with scripts written for earlier versions. -
Create an instance of that class in your script or project, define the behaviour of the crawler and start the crawling-process.
$crawler = new MyCrawler();
For a list of all available setup-options/methods of the crawler take a look at the PHPCrawler-classreference.
$crawler->setURL("www.foo.com");
$crawler->addContentTypeReceiveRule("#text/html#");
// ...
$crawler->go();