Package org.apache.nutch.scoring.depth
Class DepthScoringFilter
- java.lang.Object
- 
- org.apache.hadoop.conf.Configured
- 
- org.apache.nutch.scoring.depth.DepthScoringFilter
 
 
- 
- All Implemented Interfaces:
- Configurable,- Pluggable,- ScoringFilter
 
 public class DepthScoringFilter extends Configured implements ScoringFilter This scoring filter limits the number of hops from the initial seed urls. If the number of hops exceeds the depth (either the default value, or the one set in the injector file) then all outlinks from that url are discarded, effectively stopping further crawling along this path.
- 
- 
Field SummaryFields Modifier and Type Field Description static intDEFAULT_MAX_DEPTHstatic StringDEPTH_KEYstatic TextDEPTH_KEY_Wstatic StringMAX_DEPTH_KEYstatic TextMAX_DEPTH_KEY_W- 
Fields inherited from interface org.apache.nutch.scoring.ScoringFilterX_POINT_ID
 
- 
 - 
Constructor SummaryConstructors Constructor Description DepthScoringFilter()
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description CrawlDatumdistributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)Distribute score value from the current page to all its outlinked pages.floatgeneratorSortValue(Text url, CrawlDatum datum, float initSort)This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.floatindexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)This method calculates a indexed document score/boost.voidinitialScore(Text url, CrawlDatum datum)Set an initial score for newly discovered pages.voidinjectedScore(Text url, CrawlDatum datum)Set an initial score for newly injected pages.voidpassScoreAfterParsing(Text url, Content content, Parse parse)Currently a part of score distribution is performed using only data coming from the parsing process.voidpassScoreBeforeParsing(Text url, CrawlDatum datum, Content content)This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it intoContentmetadata.voidsetConf(Configuration conf)voidupdateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages.- 
Methods inherited from class org.apache.hadoop.conf.ConfiguredgetConf
 - 
Methods inherited from class java.lang.Objectclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 - 
Methods inherited from interface org.apache.hadoop.conf.ConfigurablegetConf
 - 
Methods inherited from interface org.apache.nutch.scoring.ScoringFilterorphanedScore
 
- 
 
- 
- 
- 
Field Detail- 
DEPTH_KEYpublic static final String DEPTH_KEY - See Also:
- Constant Field Values
 
 - 
DEPTH_KEY_Wpublic static final Text DEPTH_KEY_W 
 - 
MAX_DEPTH_KEYpublic static final String MAX_DEPTH_KEY - See Also:
- Constant Field Values
 
 - 
MAX_DEPTH_KEY_Wpublic static final Text MAX_DEPTH_KEY_W 
 - 
DEFAULT_MAX_DEPTHpublic static final int DEFAULT_MAX_DEPTH - See Also:
- Constant Field Values
 
 
- 
 - 
Method Detail- 
setConfpublic void setConf(Configuration conf) - Specified by:
- setConfin interface- Configurable
- Overrides:
- setConfin class- Configured
 
 - 
distributeScoreToOutlinkspublic CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount) throws ScoringFilterException Description copied from interface:ScoringFilterDistribute score value from the current page to all its outlinked pages.- Specified by:
- distributeScoreToOutlinksin interface- ScoringFilter
- Parameters:
- fromUrl- url of the source page
- parseData- ParseData instance, which stores relevant score value(s) in its metadata. NOTE: filters may modify this in-place, all changes will be persisted.
- targets- <url, CrawlDatum> pairs. NOTE: filters can modify this in-place, all changes will be persisted.
- adjust- a CrawlDatum instance, initially null, which implementations may use to pass adjustment values to the original CrawlDatum. When creating this instance, set its status to- CrawlDatum.STATUS_LINKED.
- allCount- number of all collected outlinks from the source page
- Returns:
- if needed, implementations may return an instance of CrawlDatum,
         with status CrawlDatum.STATUS_LINKED, which contains adjustments to be applied to the original CrawlDatum score(s) and metadata. This can be null if not needed.
- Throws:
- ScoringFilterException- there is a fatal error distributing score data from the current page to all of its outlinks
 
 - 
generatorSortValuepublic float generatorSortValue(Text url, CrawlDatum datum, float initSort) throws ScoringFilterException Description copied from interface:ScoringFilterThis method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.- Specified by:
- generatorSortValuein interface- ScoringFilter
- Parameters:
- url- url of the page
- datum- page's datum, should not be modified
- initSort- initial sort value, or a value from previous filters in chain
- Returns:
- a sort value for use in sorting and selecting the top N scoring pages during fetchlist generation
- Throws:
- ScoringFilterException- if there is a fatal error preparing the sort value
 
 - 
indexerScorepublic float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) throws ScoringFilterException Description copied from interface:ScoringFilterThis method calculates a indexed document score/boost.- Specified by:
- indexerScorein interface- ScoringFilter
- Parameters:
- url- url of the page
- doc- indexed document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.
- dbDatum- current page from CrawlDb. NOTE:- changes made to this instance are not persisted
- may be null if indexing is done without CrawlDb or if the segment is generated not from the CrawlDb (via FreeGenerator).
 
- fetchDatum- datum from FetcherOutput (containing among others the fetching status)
- parse- parsing result. NOTE: changes made to this instance are not persisted.
- inlinks- current inlinks from LinkDb. NOTE: changes made to this instance are not persisted.
- initScore- initial boost value for the indexed document.
- Returns:
- boost value for the indexed document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying the indexed document directly.
- Throws:
- ScoringFilterException- if there is a fatal error whilst calculating the indexed document score/boost
 
 - 
initialScorepublic void initialScore(Text url, CrawlDatum datum) throws ScoringFilterException Description copied from interface:ScoringFilterSet an initial score for newly discovered pages. Note: newly discovered pages have at least one inlink with its score contribution, so filter implementations may choose to set initial score to zero (unknown value), and then the inlink score contribution will set the "real" value of the new page.- Specified by:
- initialScorein interface- ScoringFilter
- Parameters:
- url- url of the page
- datum- new datum. Filters will modify it in-place.
- Throws:
- ScoringFilterException- if there is a fatal error setting an initial score for newly discovered pages
 
 - 
injectedScorepublic void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException Description copied from interface:ScoringFilterSet an initial score for newly injected pages. Note: newly injected pages may have no inlinks, so filter implementations may wish to set this score to a non-zero value, to give newly injected pages some initial credit.- Specified by:
- injectedScorein interface- ScoringFilter
- Parameters:
- url- url of the page
- datum- new datum. Filters will modify it in-place.
- Throws:
- ScoringFilterException- if there is a fatal error setting an initial score for newly injected pages
 
 - 
passScoreAfterParsingpublic void passScoreAfterParsing(Text url, Content content, Parse parse) throws ScoringFilterException Description copied from interface:ScoringFilterCurrently a part of score distribution is performed using only data coming from the parsing process. We need this method in order to ensure the presence of score data in these steps.- Specified by:
- passScoreAfterParsingin interface- ScoringFilter
- Parameters:
- url- page url
- content- original content. NOTE: modifications to this value are not persisted.
- parse- target instance to copy the score information to. Implementations may modify this in-place, primarily by setting some metadata properties.
- Throws:
- ScoringFilterException- if there is a fatal error processing score data in subsequent steps after parsing
 
 - 
passScoreBeforeParsingpublic void passScoreBeforeParsing(Text url, CrawlDatum datum, Content content) throws ScoringFilterException Description copied from interface:ScoringFilterThis method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it intoContentmetadata. This is needed in order to pass this value(s) to the mechanism that distributes it to outlinked pages.- Specified by:
- passScoreBeforeParsingin interface- ScoringFilter
- Parameters:
- url- url of the page
- datum- source datum. NOTE: modifications to this value are not persisted.
- content- instance of content. Implementations may modify this in-place, primarily by setting some metadata properties.
- Throws:
- ScoringFilterException- if there is a fatal error injecting score information from the current datum into- Contentmetadata
 
 - 
updateDbScorepublic void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) throws ScoringFilterException Description copied from interface:ScoringFilterThis method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages.- Specified by:
- updateDbScorein interface- ScoringFilter
- Parameters:
- url- url of the page
- old- original datum, with original score. May be null if this is a newly discovered page. If not null, filters should use score values from this parameter as the starting values - the- datumparameter may contain values that are no longer valid, if other updates occurred between generation and this update.
- datum- the new datum, with the original score saved at the time when fetchlist was generated. Filters should update this in-place, and it will be saved in the crawldb.
- inlinked- (partial) list of CrawlDatum-s (with their scores) from links pointing to this page, found in the current update batch.
- Throws:
- ScoringFilterException- there is a fatal error calculating a new score of- CrawlDatumduring CrawlDb update
 
 
- 
 
-