Project Documentation / Classreference

Members:

Constructor
PHPCrawler() Initiates a new crawler.

Public Methods
Basic settings
getProcessReport Retruns summarizing report-information about the crawling-process after it has finished.
go Starts the crawling process in single-process-mode.
goMultiProcessed Starts the cralwer by using multi processes.
setFollowMode Sets the basic follow-mode of the crawler.
setHTTPProtocolVersion Sets the HTTP protocol version the crawler should use for requests
setPort Sets the port to connect to for crawling the starting-url set in setUrl().
setURL Sets the URL of the first page the crawler should crawl (root-page).
setUrlCacheType Defines what type of cache will be internally used for caching URLs.
setWorkingDirectory Sets the working-directory the crawler should use for storing temporary data.
Filter-settings
addContentTypeReceiveRule Adds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received
addURLFilterRule Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.
addURLFollowRule Adds a rule to the list of rules that decide which URLs found on a page should be followd explicitly.
obeyNoFollowTags Decides whether the crawler should obey "nofollow"-tags
obeyRobotsTxt Defines whether the crawler should parse and obey robots.txt-files.
Overridable methods / User data-processing
handleDocumentInfo Override this method to get access to all information about a page or file the crawler found and received.
handleHeaderInfo Overridable method that will be called after the header of a document was received and BEFORE the content will be received.
initChildProcess Overridable method that will be called by every used child-process just before it starts the crawling-procedure.
Limit-settings
setContentSizeLimit Sets the content-size-limit for content the crawler should receive from documents.
setCrawlingDepthLimit Sets the maximum crawling depth
setRequestDelay Sets a delay for every HTTP-requests the crawler executes.
setRequestLimit Sets a limit to the total number of requests the crawler should execute.
setTrafficLimit Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
Linkfinding settings
addLinkSearchContentType Adds a rule to the list of rules that decide in what kind of documents the crawler should search for links in (regarding their content-type)
enableAggressiveLinkSearch Enables or disables agressive link-searching.
excludeLinkSearchDocumentSections Defines the sections of HTML-documents that will get ignroed by the link-finding algorithm.
setLinkExtractionTags Sets the list of html-tags the crawler should search for links in.
Process resumption
enableResumption Prepares the crawler for process-resumption.
getCrawlerId Returns the unique ID of the instance of the crawler
resume Resumes the crawling-process with the given crawler-ID
Other settings
addBasicAuthentication Adds a basic-authentication (username and password) to the list of basic authentications that will be send with requests.
addLinkPriority Adds a regular expression togehter with a priority-level to the list of rules that decide what links should be prefered.
addPostData Adds post-data together with an URL-rule to the list of post-data to send with requests.
addStreamToFileContentType Adds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file.
enableCookieHandling Enables or disables cookie-handling.
requestGzipContent Enables support/requests for gzip-encoded content.
setConnectionTimeout Sets the timeout in seconds for connection tries to hosting webservers.
setFollowRedirects Defines whether the crawler should follow redirects sent with headers by a webserver or not.
setFollowRedirectsTillContent Defines whether the crawler should follow HTTP-redirects until first content was found, regardless of defined filter-rules and follow-modes.
setProxy Assigns a proxy-server the crawler should use for all HTTP-Requests.
setStreamTimeout Sets the timeout in seconds for waiting for data on an established server-connection.
setUserAgentString Sets the "User-Agent" identification-string that will be send with HTTP-requests.
Deprecated
addFollowMatch Alias for addURLFollowRule(). (deprecated!)
addLinkExtractionTags Sets the list of html-tags from which links should be extracted from. (deprecated!)
addNonFollowMatch Alias for addURLFilterRule(). (deprecated!)
addReceiveContentType Alias for addContentTypeReceiveRule(). (deprecated!)
addReceiveToMemoryMatch Has no function anymore! (deprecated!)
addReceiveToTmpFileMatch Alias for addStreamToFileContentType(). (deprecated!)
disableExtendedLinkInfo Has no function anymore. (deprecated!)
getReport Retruns an array with summarizing report-information after the crawling-process has finished (deprecated!)
setAggressiveLinkExtraction Alias for enableAggressiveLinkSearch() (deprecated!)
setCookieHandling Alias for enableCookieHandling() (deprecated!)
setPageLimit Alias for setRequestLimit() method. (deprecated!)
setTmpFile Has no function anymore. (deprecated!)

Public Properties
class_version

Author:	Uwe Hunfeld (phpcrawl@cuab.de)	Version:	0.83
Package:	phpcrawl	Category:	-
Licence:	GPL2

Class: PHPCrawler

Author: Uwe Hunfeld (phpcrawl@cuab.de) Version: 0.83
Package: phpcrawl Category: -
Licence: GPL2

Public Methods
*Basic settings*
getProcessReport		Retruns summarizing report-information about the crawling-process after it has finished.
go		Starts the crawling process in single-process-mode.
goMultiProcessed		Starts the cralwer by using multi processes.
setFollowMode		Sets the basic follow-mode of the crawler.
setHTTPProtocolVersion		Sets the HTTP protocol version the crawler should use for requests
setPort		Sets the port to connect to for crawling the starting-url set in setUrl().
setURL		Sets the URL of the first page the crawler should crawl (root-page).
setUrlCacheType		Defines what type of cache will be internally used for caching URLs.
setWorkingDirectory		Sets the working-directory the crawler should use for storing temporary data.
*Filter-settings*
addContentTypeReceiveRule		Adds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received
addURLFilterRule		Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.
addURLFollowRule		Adds a rule to the list of rules that decide which URLs found on a page should be followd explicitly.
obeyNoFollowTags		Decides whether the crawler should obey "nofollow"-tags
obeyRobotsTxt		Defines whether the crawler should parse and obey robots.txt-files.
*Overridable methods / User data-processing*
handleDocumentInfo		Override this method to get access to all information about a page or file the crawler found and received.
handleHeaderInfo		Overridable method that will be called after the header of a document was received and BEFORE the content will be received.
initChildProcess		Overridable method that will be called by every used child-process just before it starts the crawling-procedure.
*Limit-settings*
setContentSizeLimit		Sets the content-size-limit for content the crawler should receive from documents.
setCrawlingDepthLimit		Sets the maximum crawling depth
setRequestDelay		Sets a delay for every HTTP-requests the crawler executes.
setRequestLimit		Sets a limit to the total number of requests the crawler should execute.
setTrafficLimit		Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
*Linkfinding settings*
addLinkSearchContentType		Adds a rule to the list of rules that decide in what kind of documents the crawler should search for links in (regarding their content-type)
enableAggressiveLinkSearch		Enables or disables agressive link-searching.
excludeLinkSearchDocumentSections		Defines the sections of HTML-documents that will get ignroed by the link-finding algorithm.
setLinkExtractionTags		Sets the list of html-tags the crawler should search for links in.
*Process resumption*
enableResumption		Prepares the crawler for process-resumption.
getCrawlerId		Returns the unique ID of the instance of the crawler
resume		Resumes the crawling-process with the given crawler-ID
*Other settings*
addBasicAuthentication		Adds a basic-authentication (username and password) to the list of basic authentications that will be send with requests.
addLinkPriority		Adds a regular expression togehter with a priority-level to the list of rules that decide what links should be prefered.
addPostData		Adds post-data together with an URL-rule to the list of post-data to send with requests.
addStreamToFileContentType		Adds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file.
enableCookieHandling		Enables or disables cookie-handling.
requestGzipContent		Enables support/requests for gzip-encoded content.
setConnectionTimeout		Sets the timeout in seconds for connection tries to hosting webservers.
setFollowRedirects		Defines whether the crawler should follow redirects sent with headers by a webserver or not.
setFollowRedirectsTillContent		Defines whether the crawler should follow HTTP-redirects until first content was found, regardless of defined filter-rules and follow-modes.
setProxy		Assigns a proxy-server the crawler should use for all HTTP-Requests.
setStreamTimeout		Sets the timeout in seconds for waiting for data on an established server-connection.
setUserAgentString		Sets the "User-Agent" identification-string that will be send with HTTP-requests.
*Deprecated*
addFollowMatch		Alias for addURLFollowRule(). *(deprecated!)*
addLinkExtractionTags		Sets the list of html-tags from which links should be extracted from. *(deprecated!)*
addNonFollowMatch		Alias for addURLFilterRule(). *(deprecated!)*
addReceiveContentType		Alias for addContentTypeReceiveRule(). *(deprecated!)*
addReceiveToMemoryMatch		Has no function anymore! *(deprecated!)*
addReceiveToTmpFileMatch		Alias for addStreamToFileContentType(). *(deprecated!)*
disableExtendedLinkInfo		Has no function anymore. *(deprecated!)*
getReport		Retruns an array with summarizing report-information after the crawling-process has finished *(deprecated!)*
setAggressiveLinkExtraction		Alias for enableAggressiveLinkSearch() *(deprecated!)*
setCookieHandling		Alias for enableCookieHandling() *(deprecated!)*
setPageLimit		Alias for setRequestLimit() method. *(deprecated!)*
setTmpFile		Has no function anymore. *(deprecated!)*

Class: PHPCrawler

Author:Uwe Hunfeld (phpcrawl@cuab.de)Version:0.83Package:phpcrawlCategory:-Licence:GPL2

Author: Uwe Hunfeld (phpcrawl@cuab.de) Version: 0.83
Package: phpcrawl Category: -
Licence: GPL2