Package org.apache.nutch.parse.html
Class DOMContentUtils
- java.lang.Object
- 
- org.apache.nutch.parse.html.DOMContentUtils
 
- 
 public class DOMContentUtils extends Object A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
- 
- 
Nested Class SummaryNested Classes Modifier and Type Class Description static classDOMContentUtils.LinkParams
 - 
Constructor SummaryConstructors Constructor Description DOMContentUtils(Configuration conf)
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description StringgetBase(Node node)If Node contains a BASE tag then it's HREF is returned.voidgetOutlinks(URL base, ArrayList<Outlink> outlinks, Node node)voidgetText(StringBuffer sb, Node node)This is a convinience method, equivalent togetText(sb, node, false).booleangetText(StringBuffer sb, Node node, boolean abortOnNestedAnchors)This method takes aStringBufferand a DOMNode, and will append all the content text found beneath the DOM node to theStringBuffer.booleangetTitle(StringBuffer sb, Node node)This method takes aStringBufferand a DOMNode, and will append the content text found beneath the firsttitlenode to theStringBuffer.voidsetConf(Configuration conf)
 
- 
- 
- 
Constructor Detail- 
DOMContentUtilspublic DOMContentUtils(Configuration conf) 
 
- 
 - 
Method Detail- 
setConfpublic void setConf(Configuration conf) 
 - 
getTextpublic boolean getText(StringBuffer sb, Node node, boolean abortOnNestedAnchors) This method takes aStringBufferand a DOMNode, and will append all the content text found beneath the DOM node to theStringBuffer.If abortOnNestedAnchorsis true, DOM traversal will be aborted and theStringBufferwill not contain any text encountered after a nested anchor is found.- Parameters:
- sb- a- StringBufferused to store content text found beneath the DOM node... if any exists
- node- a DOM- Nodeto check for content text
- abortOnNestedAnchors- true to abort if nested anchors are encountered, false otherwise
- Returns:
- true if nested anchors were found
 
 - 
getTextpublic void getText(StringBuffer sb, Node node) This is a convinience method, equivalent togetText(sb, node, false).- Parameters:
- sb- a- StringBufferused to store content text found beneath the DOM node... if any exists
- node- a DOM- Nodeto check for content text
 
 - 
getTitlepublic boolean getTitle(StringBuffer sb, Node node) This method takes aStringBufferand a DOMNode, and will append the content text found beneath the firsttitlenode to theStringBuffer.- Parameters:
- sb- a- StringBufferused to store content text found beneath the DOM node... if any exists
- node- a DOM- Nodeto check for content text
- Returns:
- true if a title node was found, false otherwise
 
 - 
getBasepublic String getBase(Node node) If Node contains a BASE tag then it's HREF is returned.- Parameters:
- node- a DOM- Nodeto check for a BASE tag
- Returns:
- HREF if one exists
 
 - 
getOutlinkspublic void getOutlinks(URL base, ArrayList<Outlink> outlinks, Node node) This method finds all anchors below the supplied DOMnode, and creates appropriateOutlinkrecords for each (relative to the suppliedbaseURL), and adds them to theoutlinksArrayList.Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml). 
 
- 
 
-