stormcrawler-3.5.0
Summary
Apache StormCrawler 3.5.0 decouples Selenium from the core module (#1604), improving modularity and reducing unnecessary dependencies. The release also introduces an advanced metadata filtering system (#1647) that supports complex logical operations like key=>val OR (key2=>val2 AND key3=>val3), addressing issue #711.
Additionally, multiple dependencies were upgraded, core tests improved, and deprecated code cleaned up, enhancing overall stability and maintainability.
Breaking Changes
Users upgrading and using Selenium now need to add the new Maven module stormcrawler-selenium to their pom.xml as follows:
<groupId>org.apache.stormcrawler</groupId>
<artifactId>stormcrawler-selenium</artifactId>
<version>3.5.0</version>
What's Changed
- Bump testcontainers.version from 1.21.2 to 1.21.3 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1584
- #1580 [Improvement] Convert Tags in GH actions into SHA by @Evergreenies in https://github.com/apache/stormcrawler/pull/1581
- Bump com.microsoft.playwright:playwright from 1.52.0 to 1.53.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1582
- Bump junit.version from 5.13.1 to 5.13.2 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1583
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1585
- Make MultiProxyManagerTest.testMultiProxyManagerConstructorFile() OS independent by @sigee in https://github.com/apache/stormcrawler/pull/1589
- Bump dependencies version (Commons CLI 1.9.0, OpenSearch 2.19.2) by @sigee in https://github.com/apache/stormcrawler/pull/1586
- Bump org.apache.maven.plugins:maven-enforcer-plugin from 3.5.0 to 3.6.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1593
- Bump junit.version from 5.13.2 to 5.13.3 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1591
- Bump selenium.version from 4.33.0 to 4.34.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1594
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1595
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1596
- Bump tika.version from 3.2.0 to 3.2.1 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1599
- Bump com.github.ben-manes.caffeine:caffeine from 3.2.1 to 3.2.2 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1600
- #108 Replace custom HttpHeaders constants with the org.apache.http.HttpHeaders ones by @sigee in https://github.com/apache/stormcrawler/pull/1587
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1602
- Clean up core test assertions by @sigee in https://github.com/apache/stormcrawler/pull/1603
- Create consistent API for Metadata by @sigee in https://github.com/apache/stormcrawler/pull/1598
- Bump com.github.crawler-commons:crawler-commons from 1.4 to 1.5 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1590
- Bump org.netpreserve:jwarc from 0.31.1 to 0.32.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1606
- Bump aws.version from 1.12.787 to 1.12.788 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1605
- Bump org.apache.maven.plugins:maven-enforcer-plugin from 3.6.0 to 3.6.1 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1607
- Bump okhttp.version from 4.12.0 to 5.1.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1601
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1610
- Fix java 5 language level issues by @sigee in https://github.com/apache/stormcrawler/pull/1609
- #1560 - Test coverage for Solr cloud by @mvolikas in https://github.com/apache/stormcrawler/pull/1608
- Bump org.apache.solr:solr-solrj from 9.8.1 to 9.9.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1616
- Bump com.microsoft.playwright:playwright from 1.53.0 to 1.54.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1615
- Bump opensearch.version from 2.19.2 to 2.19.3 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1613
- Bump junit.version from 5.13.3 to 5.13.4 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1614
- Bump commons-cli:commons-cli from 1.9.0 to 1.10.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1618
- Bump dev.langchain4j:langchain4j-open-ai from 1.1.0 to 1.2.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1620
- Bump actions/cache from 4.2.3 to 4.2.4 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1622
- Bump storm-client.version from 2.8.1 to 2.8.2 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1619
- Bump dev.langchain4j:langchain4j from 1.1.0 to 1.2.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1621
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1617
- Bump dev.langchain4j:langchain4j-open-ai from 1.2.0 to 1.3.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1623
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1624
- Bump tika.version from 3.2.1 to 3.2.2 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1625
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1626
- Bump actions/checkout from 4.2.2 to 5.0.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1627
- Fix NOP logger configuration in core/pom.xml by @HrishikeshUchake in https://github.com/apache/stormcrawler/pull/1628
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1629
- Bump selenium.version from 4.34.0 to 4.35.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1634
- chore: remove magic number from filterPathRepeat method by @TamimEhsan in https://github.com/apache/stormcrawler/pull/1631
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.11.2 to 3.11.3 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1633
- Bump org.mockito:mockito-core from 5.18.0 to 5.19.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1632
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1635
- Bump actions/setup-java from 4.7.1 to 5.0.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1636
- #1597 replace deprecated use of URL constructor by @TamimEhsan in https://github.com/apache/stormcrawler/pull/1630
- Bump org.jsoup:jsoup from 1.21.1 to 1.21.2 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1637
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1638
- Fix java7 issues by @sigee in https://github.com/apache/stormcrawler/pull/1639
- Bump dev.langchain4j:langchain4j-open-ai from 1.3.0 to 1.4.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1641
- Bump dev.langchain4j:langchain4j from 1.3.0 to 1.4.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1640
- Bump com.microsoft.playwright:playwright from 1.54.0 to 1.55.0 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1643
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1644
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1645
- #1604 - Externalise Selenium by @jnioche in https://github.com/apache/stormcrawler/pull/1646
- Improve MetadataFilter by @sigee in https://github.com/apache/stormcrawler/pull/1647
- Bump org.jetbrains:annotations from 26.0.2 to 26.0.2-1 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1654
- Bump aws.version from 1.12.788 to 1.12.791 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1653
- #1650 - Manage Commons Compress to avoid runtime error with Tika 3.2.2 by @rzo1 in https://github.com/apache/stormcrawler/pull/1652
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1655
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.5.3 to 3.5.4 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1657
- Bump tika.version from 3.2.2 to 3.2.3 by @dependabot[bot] in https://github.com/apache/stormcrawler/pull/1658
- Regenerated License file after dependency upgrades by @github-actions[bot] in https://github.com/apache/stormcrawler/pull/1659
New Contributors
- @Evergreenies made their first contribution in https://github.com/apache/stormcrawler/pull/1581
- @HrishikeshUchake made their first contribution in https://github.com/apache/stormcrawler/pull/1628
- @TamimEhsan made their first contribution in https://github.com/apache/stormcrawler/pull/1631
Full Changelog: https://github.com/apache/stormcrawler/compare/stormcrawler-3.4.0...stormcrawler-3.5.0