Open Source HTML Parsers in Java

10 projects

SAX-compliant Java parser for real-world (malformed) HTML; outputs well-formed XML/SAX stream. JAXP integration; command-line tool.

Details

Java HTML parser: turns dirty/ill-formed HTML into well-formed XML. Browser-like tag balancing; custom tag and rule sets; HTML5 support.

Details
JTidyInactive

Java port of HTML Tidy: syntax checker, pretty printer, clean malformed HTML. DOM interface for real-world HTML; maintained by volunteers.

Details
NekoHTMLInactive

HTML scanner and tag balancer; exposes HTML via XML APIs. Fixes common HTML mistakes; uses Xerces2 XNI. DOM and SAX parsers.

Details
HotSaxInactive

Small, fast SAX2 parser for HTML/XML/XHTML; handles malformed HTML. For web agents, scrapers, spiders. Non-validating.

Details
HTML ParserInactive

Fast real-time Java HTML parser: extraction and transformation; filters, visitors, custom tags; lexer and nested parser modes.

Details

Java library to parse HTML into a stream of tag objects or a searchable tree. Same project as HTML Parser (htmlparser. sourceforge.

Details

Java bridge to Mozilla's HTML parser: raw or dirty HTML in, Java Document out. JNI to Mozilla classes.

Details

Java library for parsing and modifying HTML; preserves unrecognised markup. Form analysis, extraction; LGPL/EPL/Apache dual-licensed options.

Details

Pure Java HTML DOM parser (HTML 4. 01): fast, tag balancing, optional end tags. Part of binhgiang project (VietSpider extractor tools).

Details