Package org.apache.nutch.urlfilter.api
Class RegexURLFilterBase
- java.lang.Object
- 
- org.apache.nutch.urlfilter.api.RegexURLFilterBase
 
- 
- All Implemented Interfaces:
- Configurable,- URLFilter,- Pluggable
 - Direct Known Subclasses:
- AutomatonURLFilter,- RegexURLFilter
 
 public abstract class RegexURLFilterBase extends Object implements URLFilter GenericURLFilterbased on regular expressions.The regular expressions rules are expressed in a file. The file of rules is determined for each implementation using the getRulesReader(Configuration conf)method.The format of this file is made of many rules (one per line): 
 [+-]<regex>
 where plus (+)means go ahead and index it and minus (-)means no.- Author:
- Jérôme Charron
 
- 
- 
Field SummaryFields Modifier and Type Field Description protected booleanhasHostDomainRulesWhether there are host- or domain-specific rules.- 
Fields inherited from interface org.apache.nutch.net.URLFilterX_POINT_ID
 
- 
 - 
Constructor SummaryConstructors Modifier Constructor Description RegexURLFilterBase()Constructs a new empty RegexURLFilterBaseRegexURLFilterBase(File filename)Constructs a new RegexURLFilter and init it with a file of rules.protectedRegexURLFilterBase(Reader reader)Constructs a new RegexURLFilter and init it with a Reader of rules.RegexURLFilterBase(String rules)Constructs a new RegexURLFilter and inits it with a list of rules.
 - 
Method SummaryAll Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected abstract RegexRulecreateRule(boolean sign, String regex)Creates a newRegexRule.protected abstract RegexRulecreateRule(boolean sign, String regex, String hostOrDomain)Creates a newRegexRule.Stringfilter(String url)Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning nullConfigurationgetConf()protected abstract ReadergetRulesReader(Configuration conf)Returns the name of the file of rules to use for a particular implementation.static voidmain(RegexURLFilterBase filter, String[] args)Filter the standard input using a RegexURLFilterBase.voidsetConf(Configuration conf)
 
- 
- 
- 
Field Detail- 
hasHostDomainRulesprotected boolean hasHostDomainRules Whether there are host- or domain-specific rules. If there are no specific rules host and domain name are not extracted from the URL to speed up the matching.readRules(Reader)automatically sets this to true if host- or domain-specific rules are used in the rule file.
 
- 
 - 
Constructor Detail- 
RegexURLFilterBasepublic RegexURLFilterBase() Constructs a new empty RegexURLFilterBase
 - 
RegexURLFilterBasepublic RegexURLFilterBase(File filename) throws IOException, IllegalArgumentException Constructs a new RegexURLFilter and init it with a file of rules.- Parameters:
- filename- is the name of rules file.
- Throws:
- IOException- if there is a fatal I/O error interpreting the input- File
- IllegalArgumentException- if there is a fatal error processing the regex rules wiuthin the- URLFilter
 
 - 
RegexURLFilterBasepublic RegexURLFilterBase(String rules) throws IOException, IllegalArgumentException Constructs a new RegexURLFilter and inits it with a list of rules.- Parameters:
- rules- string with a list of rules, one rule per line
- Throws:
- IOException- if there is a fatal I/O error interpreting the input rules
- IllegalArgumentException- if there is a fatal error processing the regex rules wiuthin the- URLFilter
 
 - 
RegexURLFilterBaseprotected RegexURLFilterBase(Reader reader) throws IOException, IllegalArgumentException Constructs a new RegexURLFilter and init it with a Reader of rules.- Parameters:
- reader- is a reader of rules.
- Throws:
- IOException- if there is a fatal I/O error interpreting the input- Reader
- IllegalArgumentException- if there is a fatal error processing the regex rules wiuthin the- URLFilter
 
 
- 
 - 
Method Detail- 
createRuleprotected abstract RegexRule createRule(boolean sign, String regex) Creates a newRegexRule.- Parameters:
- sign- of the regular expression. A- truevalue means that any URL matching this rule must be included, whereas a- falsevalue means that any URL matching this rule must be excluded.
- regex- is the regular expression associated to this rule.
- Returns:
- RegexRule
 
 - 
createRuleprotected abstract RegexRule createRule(boolean sign, String regex, String hostOrDomain) Creates a newRegexRule.- Parameters:
- sign- of the regular expression. A- truevalue means that any URL matching this rule must be included, whereas a- falsevalue means that any URL matching this rule must be excluded.
- regex- is the regular expression associated to this rule.
- hostOrDomain- the host or domain to which this regex belongs
- Returns:
- RegexRule
 
 - 
getRulesReaderprotected abstract Reader getRulesReader(Configuration conf) throws IOException Returns the name of the file of rules to use for a particular implementation.- Parameters:
- conf- is the current configuration.
- Returns:
- the name of the resource containing the rules to use.
- Throws:
- IOException- if there is a fatal error obtaining the- Reader
 
 - 
filterpublic String filter(String url) Description copied from interface:URLFilterInterface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
 - 
setConfpublic void setConf(Configuration conf) - Specified by:
- setConfin interface- Configurable
 
 - 
getConfpublic Configuration getConf() - Specified by:
- getConfin interface- Configurable
 
 - 
mainpublic static void main(RegexURLFilterBase filter, String[] args) throws IOException, IllegalArgumentException Filter the standard input using a RegexURLFilterBase.- Parameters:
- filter- is the RegexURLFilterBase to use for filtering the standard input.
- args- some optional parameters (not used).
- Throws:
- IOException- if there is a fatal I/O error interpreting the input arguments
- IllegalArgumentException- if there is a fatal error processing the input arguments
 
 
- 
 
-