WebCrawler API

Website to LLM Data API for Developers

Usage Based

Description

WebCrawler API offers a robust solution for developers needing to extract data from websites. This API simplifies the complex process of web crawling, allowing users to provide a website link and receive the content of every page in various formats, including Markdown, Text, and HTML. It is specifically designed to gather data suitable for training Large Language Models (LLMs) and for use in Retrieval-Augmented Generation (RAG) systems.

The service addresses common web crawling challenges such as handling internal links, rendering JavaScript-heavy pages, navigating anti-bot measures like CAPTCHAs and IP blocks, managing large-scale storage, scaling crawler instances, and cleaning raw HTML into usable formats. Integration is straightforward, with client libraries available for popular languages like NodeJS, Python, PHP, and .NET, enabling developers to focus on utilizing the data rather than managing crawling infrastructure.

Key Features

API Access: Provides a developer API for programmatic web crawling.
Multiple Output Formats: Extracts content as Markdown, plain Text, or raw HTML.
Handles JS Rendering: Processes JavaScript-heavy websites accurately.
Anti-Bot Evasion: Manages CAPTCHAs, IP blocks, and rate limits.
Link Management: Handles internal links, removes duplicates, and cleans URLs.
Scalable Infrastructure: Capable of handling numerous crawlers and large volumes of pages.
Automated Data Cleaning: Converts raw HTML into clean text or Markdown.
Usage-Based Pricing: Pay only for successful crawl requests, no subscriptions.
Unlimited Proxy Included: Unlimited proxy usage is part of the service.
Developer Libraries: Offers client libraries for NodeJS, Python, PHP, .NET.