apache-nutch 1.20 API
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.
| Package | Description | 
|---|---|
| org.apache.nutch.analysis.lang | Text document language identifier. | 
| org.apache.nutch.collection | Subcollection is a subset of an index. | 
| org.apache.nutch.crawl | Crawl control code and tools to run the crawler. | 
| org.apache.nutch.exchange | Control code for exchange component, which acts in indexing job and decides to
 which index writer a document should be routed, based on plugins behavior. | 
| org.apache.nutch.exchange.jexl | Plugin of Exchange component based on JEXL expressions. | 
| org.apache.nutch.fetcher | The Nutch multi-threaded fetching module | 
| org.apache.nutch.hostdb | |
| org.apache.nutch.indexer | Index content, configure and run indexing and cleaning jobs to 
 add, update, and delete documents from an index. | 
| org.apache.nutch.indexer.anchor | An indexing plugin for inbound anchor text. | 
| org.apache.nutch.indexer.arbitrary | Indexing filter to add document arbitrary data to the index
 from the output of a user-specified class. | 
| org.apache.nutch.indexer.basic | A basic indexing plugin, adds basic fields: url, host, title, content, etc. | 
| org.apache.nutch.indexer.feed | Indexing filter to index meta data from RSS feeds. | 
| org.apache.nutch.indexer.filter | |
| org.apache.nutch.indexer.geoip | This plugin implements an indexing filter which takes 
 advantage of the 
 GeoIP2-java API. | 
| org.apache.nutch.indexer.jexl | This plugin implements a dynamic indexing filter which uses JEXL 
 expressions to allow filtering based on the page's metadata | 
| org.apache.nutch.indexer.links | |
| org.apache.nutch.indexer.metadata | Indexing filter to add document metadata to the index. | 
| org.apache.nutch.indexer.more | A more indexing plugin, adds "more" index fields:last modified 
 date, MIME type, content length. | 
| org.apache.nutch.indexer.replace | Indexing filter to allow pattern replacements on metadata. | 
| org.apache.nutch.indexer.staticfield | A simple plugin called at indexing that adds fields with static data. | 
| org.apache.nutch.indexer.subcollection | Indexing filter to assign documents to subcollections. | 
| org.apache.nutch.indexer.tld | Top Level Domain Indexing plugin. | 
| org.apache.nutch.indexer.urlmeta | URL Meta Tag Indexing Plugin | 
| org.apache.nutch.indexwriter.cloudsearch | |
| org.apache.nutch.indexwriter.csv | Index writer plugin to write a plain CSV file. | 
| org.apache.nutch.indexwriter.dummy | Index writer plugin for debugging, writes pairs of <action, url> to a
 text file, action is one of "add", "update", or "delete". | 
| org.apache.nutch.indexwriter.elastic | Index writer plugin for Elasticsearch. | 
| org.apache.nutch.indexwriter.kafka | Index writer plugin to produce JSON messages to Kafka. | 
| org.apache.nutch.indexwriter.opensearch1x | Index writer plugin for OpenSearch. | 
| org.apache.nutch.indexwriter.rabbit | |
| org.apache.nutch.indexwriter.solr | Index writer plugin for Apache Solr. | 
| org.apache.nutch.metadata | A Multi-valued Metadata container, and set
 of constant fields for Nutch Metadata. | 
| org.apache.nutch.microformats.reltag | A microformats Rel-Tag
 Parser/Indexer/Querier plugin. | 
| org.apache.nutch.net | Web-related interfaces: URL  filtersandnormalizers. | 
| org.apache.nutch.net.protocols | Helper classes related to the  Protocolinterface, see alsoorg.apache.nutch.protocol. | 
| org.apache.nutch.net.urlnormalizer.ajax | |
| org.apache.nutch.net.urlnormalizer.basic | URL normalizer performing basic normalizations:
 
 remove default ports, e.g., port 80 for  http://URLs
 remove needless slashes and dot segments in the path component
 remove anchors
 use percent-encoding (only) where needed
 
 
 E.g.,https://www.example.org/a/../b//./select%2Dlang.php?lang=espaƱol#anchoris normalized tohttps://www.example.org/b/select-lang.php?lang=espa%C3%B1olOptional and configurable normalizations are:
 
 convert Internationalized Domain Names (IDNs) uniquely either to the
 ASCII (Punycode) or Unicode representation, see propertyurlnormalizer.basic.host.idnremove a trailing dot from host names, see propertyurlnormalizer.basic.host.trim-trailing-dot | 
| org.apache.nutch.net.urlnormalizer.host | URL normalizer renaming hosts to a canonical form listed in the
 configuration file. | 
| org.apache.nutch.net.urlnormalizer.pass | URL normalizer dummy which does not change URLs. | 
| org.apache.nutch.net.urlnormalizer.protocol | URL normalizer to normalize the protocol for all URLs of a given host or
 domain. | 
| org.apache.nutch.net.urlnormalizer.querystring | URL normalizer which sort the elements in the query part to avoid duplicates
 by permutations. | 
| org.apache.nutch.net.urlnormalizer.regex | URL normalizer with configurable rules based on regular expressions
 ( Pattern). | 
| org.apache.nutch.net.urlnormalizer.slash | |
| org.apache.nutch.parse | The  Parseinterface and related classes. | 
| org.apache.nutch.parse.ext | Parse wrapper to run external command to do the parsing. | 
| org.apache.nutch.parse.feed | Parse RSS feeds. | 
| org.apache.nutch.parse.headings | Parse filter to extract headings (h1, h2, etc.) from DOM parse tree. | 
| org.apache.nutch.parse.html | An HTML document parsing plugin. | 
| org.apache.nutch.parse.js | Parser and parse filter plugin to extract all (possible) links
 from JavaScript files and embedded JavaScript code snippets. | 
| org.apache.nutch.parse.metatags | Parse filter to extract meta tags: keywords, description, etc. | 
| org.apache.nutch.parse.tika | Parse various document formats with help of
 Apache Tika. | 
| org.apache.nutch.parse.zip | Parse ZIP files: embedded files are recursively passed to appropriate parsers. | 
| org.apache.nutch.parsefilter.debug | Adds serialized DOM to parse data, useful for debugging, to understand how
 the parser implementation interprets a document (not only HTML). | 
| org.apache.nutch.parsefilter.naivebayes | Html Parse filter that classifies the outlinks from the parseresult as
 relevant or irrelevant based on the parseText's relevancy (using a training
 file where you can give positive and negative example texts see the
 description of parsefilter.naivebayes.trainfile) and if found irrelevent
 it gives the link a second chance if it contains any of the words from the
 list given in parsefilter.naivebayes.wordlist. | 
| org.apache.nutch.parsefilter.regex | RegexParseFilter. | 
| org.apache.nutch.plugin | The Nutch  PluginSystem. | 
| org.apache.nutch.protocol | Classes related to the  Protocolinterface,
 see alsoorg.apache.nutch.net.protocols. | 
| org.apache.nutch.protocol.file | Protocol plugin which supports retrieving local file resources. | 
| org.apache.nutch.protocol.ftp | Protocol plugin which supports retrieving documents via the ftp protocol. | 
| org.apache.nutch.protocol.htmlunit | Protocol plugin which supports retrieving documents via HTTP/HTTPS using
 Selenium and the
 HtmlUnitDriver web
 driver for the for the
 HtmlUnit headless browser. | 
| org.apache.nutch.protocol.http | Protocol plugin which supports retrieving documents via the http protocol. | 
| org.apache.nutch.protocol.http.api | Common API used by HTTP plugins ( http,httpclient, etc.) | 
| org.apache.nutch.protocol.httpclient | Protocol plugin which supports retrieving documents via the HTTP andHTTPS
 protocols, optionally with Basic, Digest and NTLM authentication schemes for
 web server as well as proxy server. | 
| org.apache.nutch.protocol.interactiveselenium | Protocol plugin which supports retrieving documents using and interacting
 with Selenium. | 
| org.apache.nutch.protocol.interactiveselenium.handlers | Handler implementations to interact with
 Selenium for
  org.apache.nutch.protocol.interactiveselenium. | 
| org.apache.nutch.protocol.okhttp | Protocol plugin for HTTP/HTTPS based on
 okhttp, supports HTTP 1.1
 and/or http/2. | 
| org.apache.nutch.protocol.selenium | Protocol plugin which supports retrieving documents via
 Selenium. | 
| org.apache.nutch.publisher | |
| org.apache.nutch.publisher.rabbitmq | Publisher package to implement queues | 
| org.apache.nutch.rabbitmq | |
| org.apache.nutch.scoring | The  ScoringFilterinterface. | 
| org.apache.nutch.scoring.depth | Scoring filter to stop crawling at a configurable depth
 (number of "hops" from seed URLs). | 
| org.apache.nutch.scoring.link | Scoring filter used in conjunction with
  WebGraph. | 
| org.apache.nutch.scoring.metadata | Metadata Scoring Plugin | 
| org.apache.nutch.scoring.opic | Scoring filter implementing a variant of the Online Page Importance Computation
 (OPIC) algorithm. | 
| org.apache.nutch.scoring.orphan | Scoring filter to modify score or status of orphaned pages (no inlinks found
 for a configurable amount of time). | 
| org.apache.nutch.scoring.similarity | |
| org.apache.nutch.scoring.similarity.cosine | Implements the cosine similarity metric for scoring relevant documents | 
| org.apache.nutch.scoring.similarity.util | Utility package for Lucene functions. | 
| org.apache.nutch.scoring.tld | Top Level Domain Scoring plugin. | 
| org.apache.nutch.scoring.urlmeta | URL Meta Tag Scoring Plugin | 
| org.apache.nutch.scoring.webgraph | |
| org.apache.nutch.segment | A segment stores all data from on generate/fetch/update cycle:
 fetch list, protocol status, raw content, parsed content, and extracted outgoing links. | 
| org.apache.nutch.service | |
| org.apache.nutch.service.impl | |
| org.apache.nutch.service.model.request | |
| org.apache.nutch.service.model.response | |
| org.apache.nutch.service.resources | |
| org.apache.nutch.tools | Miscellaneous tools. | 
| org.apache.nutch.tools.arc | Tools to read the
 Arc file format. | 
| org.apache.nutch.tools.warc | Tools to import / export between Nutch segments and
 
 WARC archives. | 
| org.apache.nutch.urlfilter.api | Generic  URL filterlibrary,
 abstracting away from regular expression implementations. | 
| org.apache.nutch.urlfilter.automaton | URL filter plugin based on 
 dk.brics.automaton Finite-State 
 Automata for JavaTM. | 
| org.apache.nutch.urlfilter.domain | URL filter plugin to include only URLs which match an element in a given list of
 domain suffixes, domain names, and/or host names. | 
| org.apache.nutch.urlfilter.domaindenylist | URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names. | 
| org.apache.nutch.urlfilter.fast | URL filter plugin that first does fast exact suffix matches on host/domain
 names before applying regular expressions to the path component of a URL. | 
| org.apache.nutch.urlfilter.ignoreexempt | URL filter plugin which identifies exemptions to external urls when
 when external urls are set to ignore. | 
| org.apache.nutch.urlfilter.prefix | URL filter plugin to include only URLs which match one of a given list of URL prefixes. | 
| org.apache.nutch.urlfilter.regex | URL filter plugin to include and/or exclude URLs matching Java regular expressions. | 
| org.apache.nutch.urlfilter.suffix | URL filter plugin to either exclude or include only URLs which match
 one of the given (path) suffixes. | 
| org.apache.nutch.urlfilter.validator | URL filter plugin that validates given urls. | 
| org.apache.nutch.util | Miscellaneous utility classes. | 
| org.apache.nutch.util.domain | Classes for domain name analysis. | 
| org.creativecommons.nutch | Sample plugins that parse and index Creative Commons metadata. |