Project Documentation / Classreference

Class: PHPCrawlerDocumentInfo

Contains information about a page or file the crawler found and received during the crawling-process.
Description:

-

Members:

Public Properties
URL-related information
fileThe name of the requested page or file, e.g. "page.html".
hostThe host-part of the URL of the requested page or file, e.g. "www.foo.com".
pathThe path in the URL of the requested page or file, e.g. "/page/".
portThe port of the URL the request was send to, e.g. 80
protocolThe protocol-part of the URL of the page or file, e.g. "http://"
queryThe query-part of the URL of the requested page or file, e.g. "?x=y".
urlThe complete, full qualified URL of the page or file, e.g. "http://www.foo.com/bar/page.html?x=y".
url_link_depthThe linking-depth of the URL related to the entry-URL of the crawling-process.
Content-related information
bytes_receivedThe number of bytes the crawler received of the content of the document.
contentThe content of the requested document (html-sourcecode or content of file).
content_tmp_fileThe temporary file to which the content was received.
content_typeThe content-type of the page or file, e.g. "text/html" or "image/gif".
cookiesCookies send by the server.
headerThe complete HTTP-header the webserver responded with this page or file.
header_bytes_receivedThe number of bytes the crawler received of the header of the document.
http_status_codeThe HTTP-statuscode the webserver responded for the request, e.g. 200 (OK) or 404 (file not found).
meta_attributesAll meta-tag atteributes found in the source of the document.
receivedFlag indicating whether content was received from the page or file.
received_completelyFlag indicating whether content was completely received from the page or file.
received_to_fileWill be true if the content was received into temporary file.
received_to_memoryWill be true if the content was received into local memory.
responseHeaderThe complete HTTP-header the webserver responded with this page or file as a PHPCrawlerResponseHeader-object.
sourceSame as "content", the content of the requested document.
Information about found links
links_foundAn numeric array containing information about all links that were found in the source of the page.
links_found_url_descriptorsAn numeric array containing a PHPCrawlerURLDescriptor-object for every link that was found in the page.
Referer information
referer_urlThe complete URL of the page that contained the link to this document.
refering_link_rawContains the raw link as it was found in the content of the refering URL. (E.g. "../foo.html")
refering_linkcodeThe html-sourcecode that contained the link to the current document.
refering_linktextThe linktext of the link that "linked" to this document.
Error-handling
error_codeThe code of the error that perhaps occured while requesting/receiving the document. (See PHPCrawlerRequestErrors::ERROR_... - constants)
error_occuredIndicates whether an error occured while requesting/receiving the document.
error_stringA representig, human readable string for the error that perhaps occured while requesting/receiving the document.
Benchmarks
data_transfer_rateThe approximated data-transferrate for this document.
data_transfer_timeThe approximated time it took to receive the data of the document.
server_connect_timeThe time it took to connect to the server
server_response_timeThe server response time
unbuffered_bytes_readNumber of unbuffered bytes received
Deprecated
received_completlyAlias for received_completely, was spelled wrong in prevoius versions of phpcrawl. (deprecated!)
Other
header_sendThe complete HTTP-request-header the crawler sent to the server (debugging info).
traffic_limit_reachedIndicated whether the traffic-limit set by the user was reached after downloading this document.