Class HTMLParser
- All Implemented Interfaces:
LinkExtractorParser
- Direct Known Subclasses:
JsoupBasedHtmlParser
,LagartoBasedHtmlParser
HTMLParser
subclasses can parse HTML content to obtain URLs.-
Field Summary
Modifier and TypeFieldDescriptionprotected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
static final String
protected static final String
protected static final String
protected static final Pattern
static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
protected static final String
-
Constructor Summary
ModifierConstructorDescriptionprotected
Protected constructor to prevent instantiation except from within subclasses. -
Method Summary
Modifier and TypeMethodDescriptionprotected Float
extractIEVersion
(String userAgent) getEmbeddedResourceURLs
(String userAgent, byte[] html, URL baseUrl, String encoding) Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...getEmbeddedResourceURLs
(String userAgent, byte[] html, URL baseUrl, Collection<URLString> coll, String encoding) Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...getEmbeddedResourceURLs
(String userAgent, byte[] html, URL baseUrl, URLCollection coll, String encoding) Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...protected static boolean
isEnableConditionalComments
(Float ieVersion) protected static String
Normalizes URL as browsers doMethods inherited from class org.apache.jmeter.protocol.http.parser.BaseParser
getParser, isReusable
-
Field Details
-
ATT_ARCHIVE
- See Also:
-
ATT_BACKGROUND
- See Also:
-
ATT_CODE
- See Also:
-
ATT_CODEBASE
- See Also:
-
ATT_DATA
- See Also:
-
ATT_HREF
- See Also:
-
ATT_REL
- See Also:
-
ATT_SRC
- See Also:
-
ATT_STYLE
- See Also:
-
ATT_TYPE
- See Also:
-
ATT_IS_IMAGE
- See Also:
-
TAG_APPLET
- See Also:
-
TAG_BASE
- See Also:
-
TAG_BGSOUND
- See Also:
-
TAG_BODY
- See Also:
-
TAG_EMBED
- See Also:
-
TAG_FRAME
- See Also:
-
TAG_IFRAME
- See Also:
-
TAG_IMAGE
- See Also:
-
TAG_INPUT
- See Also:
-
TAG_LINK
- See Also:
-
TAG_OBJECT
- See Also:
-
TAG_SCRIPT
- See Also:
-
STYLESHEET
- See Also:
-
SHORTCUT_ICON
- See Also:
-
ICON
- See Also:
-
PRELOAD
- See Also:
-
IE_UA
- See Also:
-
IE_UA_PATTERN
-
PARSER_CLASSNAME
- See Also:
-
DEFAULT_PARSER
- See Also:
-
-
Constructor Details
-
HTMLParser
protected HTMLParser()Protected constructor to prevent instantiation except from within subclasses.
-
-
Method Details
-
getEmbeddedResourceURLs
public Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, String encoding) throws HTMLParseException Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...URLs should not appear twice in the returned iterator.
Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.
- Parameters:
userAgent
- User Agenthtml
- HTML codebaseUrl
- Base URL from which the HTML code was obtainedencoding
- Charset- Returns:
- an Iterator for the resource URLs
- Throws:
HTMLParseException
- when parsing thehtml
fails
-
getEmbeddedResourceURLs
public abstract Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, URLCollection coll, String encoding) throws HTMLParseException Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...All URLs should be added to the Collection.
Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.
N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.
- Parameters:
userAgent
- User Agenthtml
- HTML codebaseUrl
- Base URL from which the HTML code was obtainedcoll
- URLCollectionencoding
- Charset- Returns:
- an Iterator for the resource URLs
- Throws:
HTMLParseException
- when parsing thehtml
fails
-
getEmbeddedResourceURLs
public Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, Collection<URLString> coll, String encoding) throws HTMLParseException Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.
- Parameters:
userAgent
- User Agenthtml
- HTML codebaseUrl
- Base URL from which the HTML code was obtainedcoll
- Collection - will contain URLString objects, not URLsencoding
- Charset- Returns:
- an Iterator for the resource URLs
- Throws:
HTMLParseException
- when parsing thehtml
fails
-
isEnableConditionalComments
- Parameters:
ieVersion
- Float IE version- Returns:
- true if IE version < IE v10
-
extractIEVersion
- Parameters:
userAgent
- User Agent- Returns:
- version null if not IE or the version after MSIE
-
normalizeUrlValue
Normalizes URL as browsers do- Parameters:
url
-CharSequence
- Returns:
- normalized url
-