jsoup Java HTML Parser release 1.22.1
jsoup 1.22.1 is out now, adding support for the re2j regular expression engine for regex-based CSS selectors, a configurable maximum parser depth, and numerous bug fixes and improvements.
jsoup is a Java library for working with real-world HTML and XML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Added support for using the
re2jregular expression engine for regex-based CSS selectors (e.g.[attr~=regex],:matches(regex)), which ensures linear-time performance for regex evaluation. This allows safer handling of arbitrary user-supplied query regexes. To enable, add thecom.google.re2jdependency to your classpath, e.g.:
<dependency>
<groupId>com.google.re2j</groupId>
<artifactId>re2j</artifactId>
<version>1.8</version>
</dependency>
(If you already have that dependency in your classpath, but you want to keep using the Java regex engine, you can disable re2j via System.setProperty("jsoup.useRe2j", "false").) You can confirm that the re2j engine has been enabled correctly by calling Regex.usingRe2j(). #2407
- Added an instance method
Parser#unescape(String, boolean)that unescapes HTML entities using the parser's configuration (e.g. to support error tracking), complementing the existing static utilityParser.unescapeEntities(String, boolean). #2396 - Added a configurable maximum parser depth (to limit the number of open elements on stack) to both HTML and XML parsers. The HTML parser now defaults to a depth of 512 to match browser behavior, and protect against unbounded stack growth, while the XML parser keeps unlimited depth by default, but can opt into a limit via
Parser.setMaxDepth(). #2421 - Build: added CI coverage for JDK 25 #2403
- Build: added a CI fuzzer for contextual fragment parsing (in addition to existing full body HTML and XML fuzzers). oss-fuzz #14041
Changes
- Set a removal schedule of jsoup 1.24.1 for previously deprecated APIs.
Bug Fixes
- Previously cached child
Elementsof anElementwere not correctly invalidated inNode#replaceWith(Node), which could lead to incorrect results when subsequently callingElement#children(). #2391 - Attribute selector values are now compared literally without trimming. Previously, jsoup trimmed whitespace from selector values and from element attribute values, which could cause mismatches with browser behavior (e.g.
[attr=" foo "]). Now matches align with the CSS specification and browser engines. #2380 - When using the JDK HttpClient, any system default proxy (
ProxySelector.getDefault()) was ignored. Now, the system proxy is used if a per-request proxy is not set. #2388, #2390 - A
ValidationExceptioncould be thrown in the adoption agency algorithm with particularly broken input. Now logged as a parse error. #2393 - Null characters in the HTML body were not consistently removed; and in foreign content were not correctly replaced. #2395
- An
IndexOutOfBoundsExceptioncould be thrown when parsing a body fragment with crafted input. Now logged as a parse error. #2397,
Internal Changes
- Deprecated internal helper
org.jsoup.internal.Functions(for removal in v1.23.1). This was previously used to support older Android API levels without fulljava.util.functioncoverage; jsoup now requires core library desugaring so this indirection is no longer necessary. #2412
My sincere thanks to everyone who contributed to this release! If you have any suggestions for the next release, I would love to hear them; please get in touch via jsoup discussions, or with me directly.
You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.