Package org.apache.nutch.util
Class URLUtil
- java.lang.Object
- 
- org.apache.nutch.util.URLUtil
 
- 
 public class URLUtil extends Object Utility class for URL analysis
- 
- 
Constructor SummaryConstructors Constructor Description URLUtil()
 - 
Method SummaryAll Methods Static Methods Concrete Methods Modifier and Type Method Description static StringchooseRepr(String src, String dst, boolean temp)Given two urls, a src and a destination of a redirect, it returns the representative url.static StringgetDomainName(String url)Returns the domain name of the url.static StringgetDomainName(URL url)Get the domain name of the url.static DomainSuffixgetDomainSuffix(String url)Returns theDomainSuffixcorresponding to the last public part of the hostnamestatic DomainSuffixgetDomainSuffix(URL url)Returns theDomainSuffixcorresponding to the last public part of the hostnamestatic StringgetHost(String url)Returns the lowercased hostname for the URL or null if the URL is not well-formed formed.static StringgetHost(URL url)Returns the lowercased hostname for the URL.static String[]getHostSegments(String url)Partitions of the hostname of the url by "."static String[]getHostSegments(URL url)Partitions of the hostname of the url by "."static StringgetPage(String url)Returns the page for the url.static StringgetProtocol(String url)static StringgetProtocol(URL url)static StringgetTopLevelDomainName(String url)Returns the top level domain name of the url.static StringgetTopLevelDomainName(URL url)Returns the top level domain name of the url.static booleanisHomePageOf(URL url, String hostName)Test whether a URL is the home page or root page of a host.static booleanisSameDomainName(String url1, String url2)Returns whether the given urls have the same domain name.static booleanisSameDomainName(URL url1, URL url2)Returns whether the given urls have the same domain name.static voidmain(String[] args)For testingstatic URLresolveURL(URL base, String target)Resolve relative URL-s and fix a java.net.URL error in handling of URLs with pure query targets.static StringtoASCII(String url)static StringtoUNICODE(String url)
 
- 
- 
- 
Method Detail- 
resolveURLpublic static URL resolveURL(URL base, String target) throws MalformedURLException Resolve relative URL-s and fix a java.net.URL error in handling of URLs with pure query targets.- Parameters:
- base- base url
- target- target url (may be relative)
- Returns:
- resolved absolute url.
- Throws:
- MalformedURLException- if the input base URL is malformed
 
 - 
getDomainNamepublic static String getDomainName(URL url) Get the domain name of the url. The domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example
 getDomainName(new URL(http://lucene.apache.org/))
 will return
 apache.org- Parameters:
- url- A input- URLto extract the domain from
- Returns:
- the domain name string
 
 - 
getDomainNamepublic static String getDomainName(String url) throws MalformedURLException Returns the domain name of the url. The domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example
 getDomainName(conf, new http://lucene.apache.org/)
 will return
 apache.org- Parameters:
- url- A input url string to extract the domain from
- Returns:
- the domain name
- Throws:
- MalformedURLException- if the input url is malformed
 
 - 
getTopLevelDomainNamepublic static String getTopLevelDomainName(URL url) throws MalformedURLException Returns the top level domain name of the url. The top level domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example
 getTopLevelDomainName(conf, new http://lucene.apache.org/)
 will return
 org- Parameters:
- url- A input- URLto extract the top level domain name from
- Returns:
- the top level domain name
- Throws:
- MalformedURLException- if the input url is malformed
 
 - 
getTopLevelDomainNamepublic static String getTopLevelDomainName(String url) throws MalformedURLException Returns the top level domain name of the url. The top level domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example
 getTopLevelDomainName(conf, new http://lucene.apache.org/)
 will return
 org- Parameters:
- url- A input url string to extract the top level domain name from
- Returns:
- the top level domain name
- Throws:
- MalformedURLException- if the input url is malformed
 
 - 
isSameDomainNamepublic static boolean isSameDomainName(URL url1, URL url2) Returns whether the given urls have the same domain name. As an example,
 isSameDomain(new URL("http://lucene.apache.org") , new URL("http://people.apache.org/"))
 will return true.
 - 
isSameDomainNamepublic static boolean isSameDomainName(String url1, String url2) throws MalformedURLException Returns whether the given urls have the same domain name. As an example,
 isSameDomain("http://lucene.apache.org" ,"http://people.apache.org/")
 will return true.- Parameters:
- url1- first url string to compare domain name
- url2- second url string to compare domain name
- Returns:
- true if the domain names are equal
- Throws:
- MalformedURLException- if either of the input urls are malformed
 
 - 
getDomainSuffixpublic static DomainSuffix getDomainSuffix(URL url) Returns theDomainSuffixcorresponding to the last public part of the hostname- Parameters:
- url- a- URLto extract the domain suffix from
- Returns:
- a DomainSuffix
 
 - 
getDomainSuffixpublic static DomainSuffix getDomainSuffix(String url) throws MalformedURLException Returns theDomainSuffixcorresponding to the last public part of the hostname- Parameters:
- url- a- URLto extract the domain suffix from
- Returns:
- a DomainSuffix
- Throws:
- MalformedURLException- if the input url string is malformed
 
 - 
getHostSegmentspublic static String[] getHostSegments(URL url) Partitions of the hostname of the url by "."- Parameters:
- url- a- URLto extract host segments from
- Returns:
- a string array of host segments
 
 - 
getHostSegmentspublic static String[] getHostSegments(String url) throws MalformedURLException Partitions of the hostname of the url by "."- Parameters:
- url- a url string to extract host segments from
- Returns:
- a string array of host segments
- Throws:
- MalformedURLException- if the input url string is malformed
 
 - 
chooseReprpublic static String chooseRepr(String src, String dst, boolean temp) Given two urls, a src and a destination of a redirect, it returns the representative url.This method implements an extended version of the algorithm used by the Yahoo! Slurp crawler described here: 
 How does the Yahoo! webcrawler handle redirects?
 
 - Choose target url if either url is malformed.
- If different domains the keep the destination whether or not the redirect is temp or perm
- a.com -> b.com*
- If the redirect is permanent and the source is root, keep the source.
- *a.com -> a.com?y=1 || *a.com -> a.com/xyz/index.html
- If the redirect is permanent and the source is not root and the destination is root, keep the destination
- a.com/xyz/index.html -> a.com*
- If the redirect is permanent and neither the source nor the destination is root, then keep the destination
- a.com/xyz/index.html -> a.com/abc/page.html*
- If the redirect is temporary and source is root and destination is not root, then keep the source
- *a.com -> a.com/xyz/index.html
- If the redirect is temporary and source is not root and destination is root, then keep the destination
- a.com/xyz/index.html -> a.com*
- If the redirect is temporary and neither the source or the destination is root, then keep the shortest url. First check for the shortest host, and if both are equal then check by path. Path is first by length then by the number of / path separators.
- a.com/xyz/index.html -> a.com/abc/page.html*
- *www.a.com/xyz/index.html -> www.news.a.com/xyz/index.html
- If the redirect is temporary and both the source and the destination are root, then keep the shortest sub-domain
- *www.a.com -> www.news.a.com
 
 While not in this logic there is a further piece of representative url logic that occurs during indexing and after scoring. During creation of the basic fields before indexing, if a url has a representative url stored we check both the url and its representative url (which should never be the same) against their linkrank scores and the highest scoring one is kept as the url and the lower scoring one is held as the orig url inside of the index.- Parameters:
- src- The source url.
- dst- The destination url.
- temp- Is the redirect a temporary redirect.
- Returns:
- String The representative url.
 
 - 
getHostpublic static String getHost(String url) Returns the lowercased hostname for the URL or null if the URL is not well-formed formed.- Parameters:
- url- The URL to check.
- Returns:
- String the hostname for the URL.
 
 - 
getHostpublic static String getHost(URL url) Returns the lowercased hostname for the URL.- Parameters:
- url- The URL to check.
- Returns:
- String the hostname for the URL.
 
 - 
getPagepublic static String getPage(String url) Returns the page for the url. The page consists of the protocol, host, and path, but does not include the query string. The host is lowercased but the path is not.- Parameters:
- url- The url to check.
- Returns:
- String The page for the url.
 
 - 
mainpublic static void main(String[] args) For testing- Parameters:
- args- print with no args to get help
 
 - 
isHomePageOfpublic static boolean isHomePageOf(URL url, String hostName) Test whether a URL is the home page or root page of a host. This is the case if the URL path is/and query, port, fragment, userinfo are empty resp. not given. In other words the URL is:protocol://hostName/- Parameters:
- url- the URL to test
- hostName- the host name to test the URL on
- Returns:
- true if the URL is the home or root page of the host
 
 
- 
 
-