Method: PHPCrawler::excludeLinkSearchDocumentSections()



Defines the sections of HTML-documents that will get ignroed by the link-finding algorithm.
Signature:

public excludeLinkSearchDocumentSections($document_sections)

Parameters:

$document_sections int Bitwise combination of the PHPCrawlerLinkSearchDocumentSections-constants.

Returns:

No information

Description:

By default, phpcrawl is searching for links in the entire documents it receives during the crawling-process.
This sometimes brings up some non existing "phantom-URLs" because the crawler recognized i.e. some javascript-code
as a link that was not meant to be, or the crawler found a link inside an html-comment that doesn't exist anymore.

By using this method, users can define what predefined sections of HTML-documents should get ignored when it comes
to finding links.

See PHPCrawlerLinkSearchDocumentSections-constants for all predefined sections.

Example 1:// Let the crawler ignore script-sections and html-comment-sections when finding links
$crawler->excludeLinkSearchDocumentSections(PHPCrawlerLinkSearchDocumentSections::SCRIPT_SECTIONS |
                                            PHPCrawlerLinkSearchDocumentSections::HTML_COMMENT_SECTIONS);

Example 2:// Let the crawler ignore all special sections except HTML-comments
$crawler->excludeLinkSearchDocumentSections(PHPCrawlerLinkSearchDocumentSections::ALL_SPECIAL_SECTIONS ^
                                            PHPCrawlerLinkSearchDocumentSections::HTML_COMMENT_SECTIONS);