Package org.apache.nutch.urlfilter.fast
Class FastURLFilter
- java.lang.Object
- 
- org.apache.nutch.urlfilter.fast.FastURLFilter
 
- 
- All Implemented Interfaces:
- Configurable,- URLFilter,- Pluggable
 
 public class FastURLFilter extends Object implements URLFilter Filters URLs based on a file of regular expressions using host/domains matching first. The default policy is to accept a URL if no matches are found. Rule Format:Host www.example.org DenyPath /path/to/be/excluded DenyPath /some/other/path/excluded # Deny everything from *.example.com and example.com Domain example.com DenyPath .* Domain example.org DenyPathQuery /resource/.*?action=exclude Hostrules are evaluated beforeDomainrules. ForHostrules the entire host name of a URL must match while the domain names inDomainrules are considered as matches if the domain is a suffix of the host name (consisting of complete host name parts). Shorter domain suffixes are checked first, a single dot "." as "domain name" can be used to specify global rules applied to every URL. E.g., for "www.example.com" the rules given above are looked up in the following order:- check "www.example.com" whether host-based rules exist and whether one of them matches
- check "www.example.com" for domain-based rules
- check "example.com" for domain-based rules
- check "com" for domain-based rules
- check for global rules ("Domain .")
 file:/path/file.txtare checked for global rules only. URLs which fail to be parsed asURLare always rejected. For rules either the URL path (DenyPath) or path and query (DenyPathQuery) are checked whether the givenJava Regular expressionis found (seeMatcher.find()) in the URL path (and query). Rules are applied in the order of their definition. For better performance, regular expressions which are simpler/faster or match more URLs should be defined earlier. Comments in the rule file start with the#character and reach until the end of the line. The rules file is defined via the propertyurlfilter.fast.file, the default name isfast-urlfilter.txt. In addition, it can filter based on the length of the whole URL, its path element or its query element. Seeurlfilter.fast.url.*configurations.
- 
- 
Nested Class SummaryNested Classes Modifier and Type Class Description static classFastURLFilter.DenyAllRuleRule forDenyPath .*orDenyPath .?static classFastURLFilter.DenyPathQueryRulestatic classFastURLFilter.DenyPathRulestatic classFastURLFilter.Rule
 - 
Field SummaryFields Modifier and Type Field Description protected static org.slf4j.LoggerLOGstatic StringURLFILTER_FAST_FILEstatic StringURLFILTER_FAST_MAX_LENGTHstatic StringURLFILTER_FAST_PATH_MAX_LENGTHstatic StringURLFILTER_FAST_QUERY_MAX_LENGTH- 
Fields inherited from interface org.apache.nutch.net.URLFilterX_POINT_ID
 
- 
 - 
Constructor SummaryConstructors Constructor Description FastURLFilter()
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description Stringfilter(String url)Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning nullConfigurationgetConf()voidreloadRules()voidsetConf(Configuration conf)
 
- 
- 
- 
Field Detail- 
LOGprotected static final org.slf4j.Logger LOG 
 - 
URLFILTER_FAST_FILEpublic static final String URLFILTER_FAST_FILE - See Also:
- Constant Field Values
 
 - 
URLFILTER_FAST_MAX_LENGTHpublic static final String URLFILTER_FAST_MAX_LENGTH - See Also:
- Constant Field Values
 
 - 
URLFILTER_FAST_PATH_MAX_LENGTHpublic static final String URLFILTER_FAST_PATH_MAX_LENGTH - See Also:
- Constant Field Values
 
 - 
URLFILTER_FAST_QUERY_MAX_LENGTHpublic static final String URLFILTER_FAST_QUERY_MAX_LENGTH - See Also:
- Constant Field Values
 
 
- 
 - 
Method Detail- 
setConfpublic void setConf(Configuration conf) - Specified by:
- setConfin interface- Configurable
 
 - 
getConfpublic Configuration getConf() - Specified by:
- getConfin interface- Configurable
 
 - 
filterpublic String filter(String url) Description copied from interface:URLFilterInterface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
 - 
reloadRulespublic void reloadRules() throws IOException- Throws:
- IOException
 
 
- 
 
-