ScrapeGraphAI
ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents.
Categories: Artificial Intelligence
Type: scrapeGraphAi/v1
Connections
Version: 1
API Key
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| key | Key | STRING | true | |
| value | Value | STRING | true |
Connection Setup
- Login to the dashboard at https://dashboard.scrapegraphai.com/login.
- Copy the API key. Use these credentials to create a connection in ByteChef.
Actions
Get SmartCrawler Status
Name: getCrawlStatus
Get the status and results of a previous smartcrawl request.
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| task_id | Task Id | STRING | The ID of the crawl job task. | true |
Example JSON Structure
{
"label" : "Get SmartCrawler Status",
"name" : "getCrawlStatus",
"parameters" : {
"task_id" : ""
},
"type" : "scrapeGraphAi/v1/getCrawlStatus"
}Output
Type: OBJECT
Properties
| Name | Type | Description |
|---|---|---|
| status | STRING | Overall status of the request. |
| result | OBJECT Properties{STRING(status), {}(llm_result), [STRING](crawled_urls), [{STRING(url), STRING(markdown)}](pages)} | The crawl job result. |
Output Example
{
"status" : "",
"result" : {
"status" : "",
"llm_result" : { },
"crawled_urls" : [ "" ],
"pages" : [ {
"url" : "",
"markdown" : ""
} ]
}
}Markdownify
Name: markdownify
Convert any webpage into clean, readable Markdown format.
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| website_url | Website URL | STRING | Website URL. | true |
Example JSON Structure
{
"label" : "Markdownify",
"name" : "markdownify",
"parameters" : {
"website_url" : ""
},
"type" : "scrapeGraphAi/v1/markdownify"
}Output
Type: OBJECT
Properties
| Name | Type | Description |
|---|---|---|
| request_id | STRING | Unique identifier for the request. |
| status | STRING | Status of the request. One of: “queued”, “processing”, “completed”, “failed”. |
| website_url | STRING | The original website URL that was submitted. |
| result | STRING | The search results. |
| error | STRING | Error message if the request failed. Empty string if successful. |
Output Example
{
"request_id" : "",
"status" : "",
"website_url" : "",
"result" : "",
"error" : ""
}Search Scraper
Name: searchScraper
Start a AI-powered web search request.
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| user_prompt | User Prompt | STRING | The search query or question you want to ask. | true |
Example JSON Structure
{
"label" : "Search Scraper",
"name" : "searchScraper",
"parameters" : {
"user_prompt" : ""
},
"type" : "scrapeGraphAi/v1/searchScraper"
}Output
Type: OBJECT
Properties
| Name | Type | Description |
|---|---|---|
| request_id | STRING | Unique identifier for the search request. |
| status | STRING | Status of the request. One of: “queued”, “processing”, “completed”, “failed”. |
| user_prompt | STRING | The original search query that was submitted. |
| result | OBJECT Properties{} | The search results. |
| reference_urls | ARRAY Items[STRING] | List of URLs that were used as references for the answer. |
| error | STRING | Error message if the request failed. Empty string if successful. |
Output Example
{
"request_id" : "",
"status" : "",
"user_prompt" : "",
"result" : { },
"reference_urls" : [ "" ],
"error" : ""
}Smart Scraper
Name: smartScraper
Extract content from a webpage using AI by providing a natural language prompt and a URL.
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| user_prompt | User Prompt | STRING | The search query or question you want to ask. | true |
| website_url | Website URL | STRING | Website URL. | true |
Example JSON Structure
{
"label" : "Smart Scraper",
"name" : "smartScraper",
"parameters" : {
"user_prompt" : "",
"website_url" : ""
},
"type" : "scrapeGraphAi/v1/smartScraper"
}Output
Type: OBJECT
Properties
| Name | Type | Description |
|---|---|---|
| request_id | STRING | Unique identifier for the search request. |
| status | STRING | Status of the request. One of: “queued”, “processing”, “completed”, “failed”. |
| website_url | STRING | The original website URL that was submitted. |
| user_prompt | STRING | The original search query that was submitted. |
| result | OBJECT Properties{} | The search results. |
| error | STRING | Error message if the request failed. Empty string if successful. |
Output Example
{
"request_id" : "",
"status" : "",
"website_url" : "",
"user_prompt" : "",
"result" : { },
"error" : ""
}Start SmartCrawler
Name: startCrawl
Start a new web crawl request with AI extraction or markdown conversion.
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| url | URL | STRING | The starting URL for the crawl. | true |
| prompt | Prompt | STRING | Instructions for data extraction. Required when extraction_mode is true. | false |
| extraction_mode | Extraction Mode | BOOLEAN Optionstrue, false | When false, enables markdown conversion mode (2 credits per page). Default is true. | false |
| cache_website | Cache Website | BOOLEAN Optionstrue, false | Whether to cache the website content. | false |
| depth | Depth | INTEGER | Maximum crawl depth. | false |
| max_pages | Max Pages | INTEGER | Maximum number of pages to crawl. | false |
| same_domain_only | Same Domain Only | BOOLEAN Optionstrue, false | Whether to crawl only the same domain. | false |
| batch_size | Batch Size | INTEGER | Number of pages to process in each batch. | false |
| schema | Schema | OBJECT Properties{} | JSON Schema object for structured output. | false |
| rules | Rules | OBJECT Properties{[STRING](exclude), [STRING](include_paths), [STRING](exclude_paths), BOOLEAN(same_domain)} | Crawl rules for filtering URLs. | false |
| sitemap | Sitemap | BOOLEAN Optionstrue, false | Use sitemap.xml for discovery. | false |
| render_heavy_js | Render Heavy JS | BOOLEAN Optionstrue, false | Enable heavy JavaScript rendering. | false |
| stealth | Stealth | BOOLEAN Optionstrue, false | Enable stealth mode to bypass bot protection using advanced anti-detection techniques. Adds +4 credits to the request cost. | false |
Example JSON Structure
{
"label" : "Start SmartCrawler",
"name" : "startCrawl",
"parameters" : {
"url" : "",
"prompt" : "",
"extraction_mode" : false,
"cache_website" : false,
"depth" : 1,
"max_pages" : 1,
"same_domain_only" : false,
"batch_size" : 1,
"schema" : { },
"rules" : {
"exclude" : [ "" ],
"include_paths" : [ "" ],
"exclude_paths" : [ "" ],
"same_domain" : false
},
"sitemap" : false,
"render_heavy_js" : false,
"stealth" : false
},
"type" : "scrapeGraphAi/v1/startCrawl"
}Output
Type: OBJECT
Properties
| Name | Type | Description |
|---|---|---|
| task_id | STRING | Unique identifier for the crawl task. Use this task_id to retrieve the crawl result. |
Output Example
{
"task_id" : ""
}How is this guide?
Last updated on