Open Source Crawlers in Java
12 projectsJava web data extraction and scraping framework. XML config; XPath, XQuery, regex; HTTP client, HTML/XML parsing, plugin system. CLI and web IDE; outputs WARC-style and structured data.
Ex-CrawlerInactive
Configurable Java crawler with distributed/volunteer features. MySQL, MSSQL or PostgreSQL storage; plugin interfaces, socket server, webfrontend search engine; graphical client for distributed crawling.
Java Web CrawlerInactive
Simple Java web crawling utility; supports the robots exclusion standard. Former Sun developer article; reference implementation.