Class SuffixURLFilter
- java.lang.Object
- 
- org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
- 
- All Implemented Interfaces:
- Configurable,- URLFilter,- Pluggable
 
 public class SuffixURLFilter extends Object implements URLFilter Filters URLs based on a file of URL suffixes. The file is named by- property "urlfilter.suffix.file" in ./conf/nutch-default.xml, and
- attribute "file" in plugin.xml of this plugin
 This filter can be configured to work in one of two modes: - default to reject ('-'): in this mode, only URLs that match suffixes specified in the config file will be accepted, all other URLs will be rejected.
- default to accept ('+'): in this mode, only URLs that match suffixes specified in the config file will be rejected, all other URLs will be accepted.
 The format of this config file is one URL suffix per line, with no preceding whitespace. Order, in which suffixes are specified, doesn't matter. Blank lines and comments (#) are allowed. A single '+' or '-' sign not followed by any suffix must be used once, to signify the mode this plugin operates in. An optional single 'I' can be appended, to signify that suffix matches should be case-insensitive. The default, if not specified, is to use case-sensitive matches, i.e. suffix '.JPG' does not match '.jpg'. NOTE: the format of this file is different from urlfilter-prefix, because that plugin doesn't support allowed/prohibited prefixes (only supports allowed prefixes). Please note that this plugin does not support regular expressions, it only accepts literal suffixes. I.e. a suffix "+*.jpg" is most probably wrong, you should use "+.jpg" instead. ExamplesExample 1The configuration shown below will accept all URLs with '.html' or '.htm' suffixes (case-sensitive - '.HTML' or '.HTM' will be rejected), and prohibit all other suffixes. # this is a comment # prohibit all unknown, case-sensitive matching - # collect only HTML files. .html .htm Example 2The configuration shown below will accept all URLs except common graphical formats. # this is a comment # allow all unknown, case-insensitive matching +I # prohibited suffixes .gif .png .jpg .jpeg .bmp 
- 
- 
Field Summary- 
Fields inherited from interface org.apache.nutch.net.URLFilterX_POINT_ID
 
- 
 - 
Constructor SummaryConstructors Constructor Description SuffixURLFilter()SuffixURLFilter(Reader reader)
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description Stringfilter(String url)Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning nullConfigurationgetConf()booleanisIgnoreCase()booleanisModeAccept()static voidmain(String[] args)voidreadConfiguration(Reader reader)voidsetConf(Configuration conf)voidsetFilterFromPath(boolean filterFromPath)voidsetIgnoreCase(boolean ignoreCase)voidsetModeAccept(boolean modeAccept)
 
- 
- 
- 
Constructor Detail- 
SuffixURLFilterpublic SuffixURLFilter() throws IOException- Throws:
- IOException
 
 - 
SuffixURLFilterpublic SuffixURLFilter(Reader reader) throws IOException - Throws:
- IOException
 
 
- 
 - 
Method Detail- 
filterpublic String filter(String url) Description copied from interface:URLFilterInterface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
 - 
readConfigurationpublic void readConfiguration(Reader reader) throws IOException - Throws:
- IOException
 
 - 
mainpublic static void main(String[] args) throws IOException - Throws:
- IOException
 
 - 
setConfpublic void setConf(Configuration conf) - Specified by:
- setConfin interface- Configurable
 
 - 
getConfpublic Configuration getConf() - Specified by:
- getConfin interface- Configurable
 
 - 
isModeAcceptpublic boolean isModeAccept() 
 - 
setModeAcceptpublic void setModeAccept(boolean modeAccept) 
 - 
isIgnoreCasepublic boolean isIgnoreCase() 
 - 
setIgnoreCasepublic void setIgnoreCase(boolean ignoreCase) 
 - 
setFilterFromPathpublic void setFilterFromPath(boolean filterFromPath) 
 
- 
 
-