Class: PHPCrawler

PHPCrawl mainclass



PHPCrawler()Initiates a new crawler.

Public Methods
Basic settings
getProcessReport Retruns summarizing report-information about the crawling-process after it has finished.
go Starts the crawling process in single-process-mode.
goMultiProcessed Starts the cralwer by using multi processes.
setFollowMode Sets the basic follow-mode of the crawler.
setHTTPProtocolVersion Sets the HTTP protocol version the crawler should use for requests
setPort Sets the port to connect to for crawling the starting-url set in setUrl().
setURL Sets the URL of the first page the crawler should crawl (root-page).
setUrlCacheType Defines what type of cache will be internally used for caching URLs.
setWorkingDirectory Sets the working-directory the crawler should use for storing temporary data.
addContentTypeReceiveRule Adds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received
addURLFilterRule Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.
addURLFollowRule Adds a rule to the list of rules that decide which URLs found on a page should be followd explicitly.
obeyNoFollowTags Decides whether the crawler should obey "nofollow"-tags
obeyRobotsTxt Defines whether the crawler should parse and obey robots.txt-files.
Overridable methods / User data-processing
handleDocumentInfo Override this method to get access to all information about a page or file the crawler found and received.
handleHeaderInfo Overridable method that will be called after the header of a document was received and BEFORE the content will be received.
initChildProcess Overridable method that will be called by every used child-process just before it starts the crawling-procedure.
setContentSizeLimit Sets the content-size-limit for content the crawler should receive from documents.
setCrawlingDepthLimit Sets the maximum crawling depth
setRequestDelay Sets a delay for every HTTP-requests the crawler executes.
setRequestLimit Sets a limit to the total number of requests the crawler should execute.
setTrafficLimit Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
Linkfinding settings
addLinkSearchContentType Adds a rule to the list of rules that decide in what kind of documents the crawler should search for links in (regarding their content-type)
enableAggressiveLinkSearch Enables or disables agressive link-searching.
excludeLinkSearchDocumentSections Defines the sections of HTML-documents that will get ignroed by the link-finding algorithm.
setLinkExtractionTags Sets the list of html-tags the crawler should search for links in.
Process resumption
enableResumption Prepares the crawler for process-resumption.
getCrawlerId Returns the unique ID of the instance of the crawler
resume Resumes the crawling-process with the given crawler-ID
Other settings
addBasicAuthentication Adds a basic-authentication (username and password) to the list of basic authentications that will be send with requests.
addLinkPriority Adds a regular expression togehter with a priority-level to the list of rules that decide what links should be prefered.
addPostData Adds post-data together with an URL-rule to the list of post-data to send with requests.
addStreamToFileContentType Adds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file.
enableCookieHandling Enables or disables cookie-handling.
requestGzipContent Enables support/requests for gzip-encoded content.
setConnectionTimeout Sets the timeout in seconds for connection tries to hosting webservers.
setFollowRedirects Defines whether the crawler should follow redirects sent with headers by a webserver or not.
setFollowRedirectsTillContent Defines whether the crawler should follow HTTP-redirects until first content was found, regardless of defined filter-rules and follow-modes.
setProxy Assigns a proxy-server the crawler should use for all HTTP-Requests.
setStreamTimeout Sets the timeout in seconds for waiting for data on an established server-connection.
setUserAgentString Sets the "User-Agent" identification-string that will be send with HTTP-requests.
addFollowMatch Alias for addURLFollowRule(). (deprecated!)
addLinkExtractionTags Sets the list of html-tags from which links should be extracted from. (deprecated!)
addNonFollowMatch Alias for addURLFilterRule(). (deprecated!)
addReceiveContentType Alias for addContentTypeReceiveRule(). (deprecated!)
addReceiveToMemoryMatch Has no function anymore! (deprecated!)
addReceiveToTmpFileMatch Alias for addStreamToFileContentType(). (deprecated!)
disableExtendedLinkInfo Has no function anymore. (deprecated!)
getReport Retruns an array with summarizing report-information after the crawling-process has finished (deprecated!)
setAggressiveLinkExtraction Alias for enableAggressiveLinkSearch() (deprecated!)
setCookieHandling Alias for enableCookieHandling() (deprecated!)
setPageLimit Alias for setRequestLimit() method. (deprecated!)
setTmpFile Has no function anymore. (deprecated!)

Public Properties