Open Source Crawlers in Java

12 projects

Open-source multi-threaded web crawler for Java. Simple interface; configurable depth, politeness, SSL, proxy, resumable crawling. Maven artifact edu.

Details

Heritrix

Internet Archive's open-source, extensible, web-scale archival crawler. Respects robots. txt; writes WARC; configurable and cluster-capable.

Details

Web-Harvest

Java web data extraction and scraping framework. XML config; XPath, XQuery, regex; HTTP client, HTML/XML parsing, plugin system. CLI and web IDE; outputs WARC-style and structured data.

Details

BixoInactive

Web mining toolkit as Cascading pipes on Hadoop. Custom pipe assemblies for fetch, parse, analyze, publish; uses Apache Tika. Build specialized web mining apps.

Details

ArachnidInactive

Java web spider framework. Simple HTML parser; subclass Arachnid and implement handleLink (and related handlers) to build a site-crawling spider. Visitor pattern, Ant build.

Details

Ex-CrawlerInactive

Configurable Java crawler with distributed/volunteer features. MySQL, MSSQL or PostgreSQL storage; plugin interfaces, socket server, webfrontend search engine; graphical client for distributed crawling.

Details

JoBoInactive

Java tool to mirror complete websites. Web spider with form filling and cookie-based session handling for automated login; flexible rules by URL, size, MIME type. GUI and CLI.

Details

JSpiderInactive

Configurable, customizable web spider engine in pure Java. LGPL. Use for link checking, site structure analysis, sitemaps, downloading sites; extensible via plugins.

Details

WebEaterInactive

Pure Java program for web site retrieval and offline viewing.

Details

WebLechInactive

Fully featured Java web site download/mirror tool. Multithreaded; configurable depth/breadth, URL filtering, checkpointing, basic auth and referer support; MIT licensed.

Details

Java Web CrawlerInactive

Simple Java web crawling utility; supports the robots exclusion standard. Former Sun developer article; reference implementation.

Details

WebSPHINXInactive

Java class library and Crawler Workbench for building web crawlers. Multithreaded retrieval, page/link model, robot exclusion, pattern matching; CMU research project (Apache-style license).

Details