Open Source Crawlers in Java

12 projects

Open-source multi-threaded web crawler for Java. Simple interface; configurable depth, politeness, SSL, proxy, resumable crawling. Maven artifact edu.

Details

Internet Archive's open-source, extensible, web-scale archival crawler. Respects robots. txt; writes WARC; configurable and cluster-capable.

Details

Java web data extraction and scraping framework. XML config; XPath, XQuery, regex; HTTP client, HTML/XML parsing, plugin system. CLI and web IDE; outputs WARC-style and structured data.

Details
BixoInactive

Web mining toolkit as Cascading pipes on Hadoop. Custom pipe assemblies for fetch, parse, analyze, publish; uses Apache Tika. Build specialized web mining apps.

Details
ArachnidInactive

Java web spider framework. Simple HTML parser; subclass Arachnid and implement handleLink (and related handlers) to build a site-crawling spider. Visitor pattern, Ant build.

Details
Ex-CrawlerInactive

Configurable Java crawler with distributed/volunteer features. MySQL, MSSQL or PostgreSQL storage; plugin interfaces, socket server, webfrontend search engine; graphical client for distributed crawling.

Details
JoBoInactive

Java tool to mirror complete websites. Web spider with form filling and cookie-based session handling for automated login; flexible rules by URL, size, MIME type. GUI and CLI.

Details
JSpiderInactive

Configurable, customizable web spider engine in pure Java. LGPL. Use for link checking, site structure analysis, sitemaps, downloading sites; extensible via plugins.

Details
WebEaterInactive

Pure Java program for web site retrieval and offline viewing.

Details
WebLechInactive

Fully featured Java web site download/mirror tool. Multithreaded; configurable depth/breadth, URL filtering, checkpointing, basic auth and referer support; MIT licensed.

Details

Simple Java web crawling utility; supports the robots exclusion standard. Former Sun developer article; reference implementation.

Details
WebSPHINXInactive

Java class library and Crawler Workbench for building web crawlers. Multithreaded retrieval, page/link model, robot exclusion, pattern matching; CMU research project (Apache-style license).

Details