Описание тега htmlcleaner

HtmlCleaner is open-source HTML parser written in Java.

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

HtmlCleaner can be used in java code, as command line tool or as Ant task. It is designed to be small, independent (no runtime dependencies except JRE 1.5+), fast and flexible (its behavior is configurable through number of parameters). Although the main motive was to prepare ordinary HTML for XML processing with XPath, XQuery and XSLT, structured data produced by HtmlCleaner may be consumed and handled in menu other ways.

Features:

  • HtmlCleaner parses input HTML and generates tree-structure suitable for programmatic manipulation.
  • Serializers are responsible for outputting the DOM structure to XML, HTML, DOM or JDom.
  • Parsing phase relies on tag descriptions which can be customized by the user.
  • HtmlCleaner's behaviour can be configured through number of parameters.
  • HtmlCleaner is thread safe, meaning that single instance can clean multiple html sources at the same time.
  • HtmlCleaner can be used from Java code, from command line or as Ant task.
  • HtmlCleaner requires JRE 1.5+.

Official Website: http://htmlcleaner.sourceforge.net/

Useful Links: