public obeyRobotsTxt($mode, $robots_txt_uri = null)
$mode | bool | Set to TRUE if you want the crawler to obey robots.txt-files. |
$robots_txt_uri | string | Optionally. The URL or path to the robots.txt-file to obey as URI (like "http://mysite.com/path/myrobots.txt" or "file://../a_robots_file.txt") If not set (or set to null), the crawler uses the default robots.txt-location of the root-URL ("http://rooturl.com/robots.txt") |
bool |
If this is set to TRUE, the crawler looks for a robots.txt-file for the root-URL of the crawling-process at the default location
and - if present - parses it and obeys all containig directives appliying to the
useragent-identification of the cralwer ("PHPCrawl" by default or manually set by calling setUserAgentString())
The default-value is FALSE (for compatibility reasons).
Pleas note that the directives found in a robots.txt-file have a higher priority than other settings made by the user.
If e.g. addFollowMatch("#http://foo\.com/path/file\.html#") was set, but a directive in the robots.txt-file of the host
foo.com says "Disallow: /path/", the URL http://foo.com/path/file.html will be ignored by the crawler anyway.