Public Methods |
---|
Basic settings |
getProcessReport | | Retruns summarizing report-information about the crawling-process after it has finished. |
go | | Starts the crawling process in single-process-mode. |
goMultiProcessed | | Starts the cralwer by using multi processes. |
setFollowMode | | Sets the basic follow-mode of the crawler. |
setHTTPProtocolVersion | | Sets the HTTP protocol version the crawler should use for requests |
setPort | | Sets the port to connect to for crawling the starting-url set in setUrl(). |
setURL | | Sets the URL of the first page the crawler should crawl (root-page). |
setUrlCacheType | | Defines what type of cache will be internally used for caching URLs. |
setWorkingDirectory | | Sets the working-directory the crawler should use for storing temporary data. |
Filter-settings |
addContentTypeReceiveRule | | Adds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received |
addURLFilterRule | | Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler. |
addURLFollowRule | | Adds a rule to the list of rules that decide which URLs found on a page should be followd explicitly. |
obeyNoFollowTags | | Decides whether the crawler should obey "nofollow"-tags |
obeyRobotsTxt | | Defines whether the crawler should parse and obey robots.txt-files. |
Overridable methods / User data-processing |
handleDocumentInfo | | Override this method to get access to all information about a page or file the crawler found and received. |
handleHeaderInfo | | Overridable method that will be called after the header of a document was received and BEFORE the content
will be received. |
initChildProcess | | Overridable method that will be called by every used child-process just before it starts the crawling-procedure. |
Limit-settings |
setContentSizeLimit | | Sets the content-size-limit for content the crawler should receive from documents. |
setCrawlingDepthLimit | | Sets the maximum crawling depth |
setRequestDelay | | Sets a delay for every HTTP-requests the crawler executes. |
setRequestLimit | | Sets a limit to the total number of requests the crawler should execute. |
setTrafficLimit | | Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process. |
Linkfinding settings |
addLinkSearchContentType | | Adds a rule to the list of rules that decide in what kind of documents the crawler
should search for links in (regarding their content-type) |
enableAggressiveLinkSearch | | Enables or disables agressive link-searching. |
excludeLinkSearchDocumentSections | | Defines the sections of HTML-documents that will get ignroed by the link-finding algorithm. |
setLinkExtractionTags | | Sets the list of html-tags the crawler should search for links in. |
Process resumption |
enableResumption | | Prepares the crawler for process-resumption. |
getCrawlerId | | Returns the unique ID of the instance of the crawler |
resume | | Resumes the crawling-process with the given crawler-ID |
Other settings |
addBasicAuthentication | | Adds a basic-authentication (username and password) to the list of basic authentications that will be send with requests. |
addLinkPriority | | Adds a regular expression togehter with a priority-level to the list of rules that decide what links should be prefered. |
addPostData | | Adds post-data together with an URL-rule to the list of post-data to send with requests. |
addStreamToFileContentType | | Adds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file. |
enableCookieHandling | | Enables or disables cookie-handling. |
requestGzipContent | | Enables support/requests for gzip-encoded content. |
setConnectionTimeout | | Sets the timeout in seconds for connection tries to hosting webservers. |
setFollowRedirects | | Defines whether the crawler should follow redirects sent with headers by a webserver or not. |
setFollowRedirectsTillContent | | Defines whether the crawler should follow HTTP-redirects until first content was found, regardless of defined filter-rules and follow-modes. |
setProxy | | Assigns a proxy-server the crawler should use for all HTTP-Requests. |
setStreamTimeout | | Sets the timeout in seconds for waiting for data on an established server-connection. |
setUserAgentString | | Sets the "User-Agent" identification-string that will be send with HTTP-requests. |
Deprecated |
addFollowMatch | | Alias for addURLFollowRule(). (deprecated!) |
addLinkExtractionTags | | Sets the list of html-tags from which links should be extracted from. (deprecated!) |
addNonFollowMatch | | Alias for addURLFilterRule(). (deprecated!) |
addReceiveContentType | | Alias for addContentTypeReceiveRule(). (deprecated!) |
addReceiveToMemoryMatch | | Has no function anymore! (deprecated!) |
addReceiveToTmpFileMatch | | Alias for addStreamToFileContentType(). (deprecated!) |
disableExtendedLinkInfo | | Has no function anymore. (deprecated!) |
getReport | | Retruns an array with summarizing report-information after the crawling-process has finished (deprecated!) |
setAggressiveLinkExtraction | | Alias for enableAggressiveLinkSearch() (deprecated!) |
setCookieHandling | | Alias for enableCookieHandling() (deprecated!) |
setPageLimit | | Alias for setRequestLimit() method. (deprecated!) |
setTmpFile | | Has no function anymore. (deprecated!) |