Class HttpRobotRulesParser
- java.lang.Object
- 
- org.apache.nutch.protocol.RobotRulesParser
- 
- org.apache.nutch.protocol.http.api.HttpRobotRulesParser
 
 
- 
- All Implemented Interfaces:
- Configurable,- Tool
 
 public class HttpRobotRulesParser extends RobotRulesParser This class is used for parsing robots for urls belonging to HTTP protocol. It extends the genericRobotRulesParserclass and contains Http protocol specific implementation for obtaining the robots file.
- 
- 
Field SummaryFields Modifier and Type Field Description protected booleanallowForbiddenprotected booleandeferVisits503- 
Fields inherited from class org.apache.nutch.protocol.RobotRulesParseragentNames, allowList, CACHE, conf, DEFER_VISIT_RULES, EMPTY_RULES, FORBID_ALL_RULES, maxNumRedirects
 
- 
 - 
Constructor SummaryConstructors Constructor Description HttpRobotRulesParser(Configuration conf)
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voidaddRobotsContent(List<Content> robotsTxtContent, URL robotsUrl, Response robotsResponse)AppendContentof robots.txt to robotsTxtContentprotected static StringgetCacheKey(URL url)Compose unique key to store and access robot rules in cache for given URLcrawlercommons.robots.BaseRobotRulesgetRobotRulesSet(Protocol http, URL url, List<Content> robotsTxtContent)Get the rules from robots.txt which applies for the givenurl.voidsetConf(Configuration conf)Set theConfigurationobject- 
Methods inherited from class org.apache.nutch.protocol.RobotRulesParsergetConf, getRobotRulesSet, isAllowListed, main, parseRules, parseRules, run
 
- 
 
- 
- 
- 
Constructor Detail- 
HttpRobotRulesParserpublic HttpRobotRulesParser(Configuration conf) 
 
- 
 - 
Method Detail- 
setConfpublic void setConf(Configuration conf) Description copied from class:RobotRulesParserSet theConfigurationobject- Specified by:
- setConfin interface- Configurable
- Overrides:
- setConfin class- RobotRulesParser
 
 - 
getCacheKeyprotected static String getCacheKey(URL url) Compose unique key to store and access robot rules in cache for given URL- Parameters:
- url- to generate a unique key for
- Returns:
- the cached unique key
 
 - 
getRobotRulesSetpublic crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url, List<Content> robotsTxtContent) Get the rules from robots.txt which applies for the givenurl. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.Following RFC 9309, section 2.3.1.2. Redirects, up to five consecutive HTTP redirects are followed when fetching the robots.txt file. The max. number of redirects followed is configurable by the property http.robots.redirect.max.- Specified by:
- getRobotRulesSetin class- RobotRulesParser
- Parameters:
- http- The- Protocolobject
- url- URL
- robotsTxtContent- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response- Contentis appended to the passed list. If null is passed nothing is stored.
- Returns:
- robotRules A BaseRobotRulesobject for the rules
 
 - 
addRobotsContentprotected void addRobotsContent(List<Content> robotsTxtContent, URL robotsUrl, Response robotsResponse) AppendContentof robots.txt to robotsTxtContent- Parameters:
- robotsTxtContent- container to store robots.txt response content
- robotsUrl- robots.txt URL
- robotsResponse- response object to be stored
 
 
- 
 
-