Class JsoupBasedHtmlParser

All Implemented Interfaces:
LinkExtractorParser

public class JsoupBasedHtmlParser extends HTMLParser
Parser based on JSOUP
Since:
2.10 TODO Factor out common code between LagartoBasedHtmlParser and this one (adapter pattern)
  • Constructor Details

    • JsoupBasedHtmlParser

      public JsoupBasedHtmlParser()
  • Method Details

    • getEmbeddedResourceURLs

      public Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, URLCollection coll, String encoding) throws HTMLParseException
      Description copied from class: HTMLParser
      Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

      All URLs should be added to the Collection.

      Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.

      N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.

      Specified by:
      getEmbeddedResourceURLs in class HTMLParser
      Parameters:
      userAgent - User Agent
      html - HTML code
      baseUrl - Base URL from which the HTML code was obtained
      coll - URLCollection
      encoding - Charset
      Returns:
      an Iterator for the resource URLs
      Throws:
      HTMLParseException - when parsing the html fails