Documentation for method: PHPCrawler::obeyRobotsTxt()

Defines whether the crawler should parse and obey robots.txt-files.

Signature:

public obeyRobotsTxt($mode, $robots_txt_uri = null)

Parameters:

$mode	bool	Set to TRUE if you want the crawler to obey robots.txt-files.
$robots_txt_uri	string	Optionally. The URL or path to the robots.txt-file to obey as URI (like "http://mysite.com/path/myrobots.txt" or "file://../a_robots_file.txt") If not set (or set to null), the crawler uses the default robots.txt-location of the root-URL ("http://rooturl.com/robots.txt")

Returns:

bool

Description:

If this is set to TRUE, the crawler looks for a robots.txt-file for the root-URL of the crawling-process at the default location
and - if present - parses it and obeys all containig directives appliying to the
useragent-identification of the cralwer ("PHPCrawl" by default or manually set by calling setUserAgentString())

The default-value is FALSE (for compatibility reasons).

Pleas note that the directives found in a robots.txt-file have a higher priority than other settings made by the user.
If e.g. addFollowMatch("#http://foo\.com/path/file\.html#") was set, but a directive in the robots.txt-file of the host
foo.com says "Disallow: /path/", the URL http://foo.com/path/file.html will be ignored by the crawler anyway.

Method: PHPCrawler::obeyRobotsTxt()

<< Back to class-overview