Package org.apache.nutch.util
Class EncodingDetector
- java.lang.Object
- 
- org.apache.nutch.util.EncodingDetector
 
- 
 public class EncodingDetector extends Object A simple class for detecting character encodings.Broadly this encompasses two functions, which are distinctly separate: - Auto detecting a set of "clues" from input text.
- Taking a set of clues and making a "best guess" as to the "real" encoding.
 A caller will often have some extra information about what the encoding might be (e.g. from the HTTP header or HTML meta-tags, often wrong but still potentially useful clues). The types of clues may differ from caller to caller. Thus a typical calling sequence is: - Run step (1) to generate a set of auto-detected clues;
- Combine these clues with the caller-dependent "extra clues" available;
- Run step (2) to guess what the most probable answer is.
 
- 
- 
Field SummaryFields Modifier and Type Field Description static StringMIN_CONFIDENCE_KEYstatic intNO_THRESHOLD
 - 
Constructor SummaryConstructors Constructor Description EncodingDetector(Configuration conf)
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddClue(String value, String source)voidaddClue(String value, String source, int confidence)voidautoDetectClues(Content content, boolean filter)voidclearClues()Clears all clues.StringguessEncoding(Content content, String defaultValue)Guess the encoding with the previously specified list of clues.static voidmain(String[] args)static StringparseCharacterEncoding(String contentType)Parse the character encoding from the specified content type header.static StringresolveEncodingAlias(String encoding)
 
- 
- 
- 
Field Detail- 
NO_THRESHOLDpublic static final int NO_THRESHOLD - See Also:
- Constant Field Values
 
 - 
MIN_CONFIDENCE_KEYpublic static final String MIN_CONFIDENCE_KEY - See Also:
- Constant Field Values
 
 
- 
 - 
Constructor Detail- 
EncodingDetectorpublic EncodingDetector(Configuration conf) 
 
- 
 - 
Method Detail- 
autoDetectCluespublic void autoDetectClues(Content content, boolean filter) 
 - 
guessEncodingpublic String guessEncoding(Content content, String defaultValue) Guess the encoding with the previously specified list of clues.- Parameters:
- content- Content instance
- defaultValue- Default encoding to return if no encoding can be detected with enough confidence. Note that this will not be normalized with- resolveEncodingAlias(java.lang.String)
- Returns:
- Guessed encoding or defaultValue
 
 - 
clearCluespublic void clearClues() Clears all clues.
 - 
parseCharacterEncodingpublic static String parseCharacterEncoding(String contentType) Parse the character encoding from the specified content type header. If the content type is null, or there is no explicit character encoding,nullis returned.
 This method was copied from org.apache.catalina.util.RequestUtil, which is licensed under the Apache License, Version 2.0 (the "License").- Parameters:
- contentType- a content type header
- Returns:
- a trimmed string representation of the 'charset=' value, null if this is not available
 
 - 
mainpublic static void main(String[] args) throws IOException - Throws:
- IOException
 
 
- 
 
-