ScrapeGraphAI
ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents.
Categories: Artificial Intelligence
Type: scrapeGraphAi/v1
Connections
Version: 1
API Key
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| key | Key | STRING | true | |
| value | Value | STRING | true |
Connection Setup
- Login to the dashboard at https://dashboard.scrapegraphai.com/login.
- Copy the API key. Use these credentials to create a connection in ByteChef.
Actions
Get SmartCrawler Status
Name: getCrawlStatus
Get the status and results of a previous smartcrawl request.
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| task_id | Task Id | STRING | The ID of the crawl job task. | true |
Example JSON Structure
{
"label" : "Get SmartCrawler Status",
"name" : "getCrawlStatus",
"parameters" : {
"task_id" : ""
},
"type" : "scrapeGraphAi/v1/getCrawlStatus"
}Output
Type: OBJECT
Properties
| Name | Type | Description |
|---|---|---|
| status | STRING | Overall status of the request. |
| result | OBJECT Properties{STRING(status), {}(llm_result), [STRING](crawled_urls), [{STRING(url), STRING(markdown)}](pages)} | The crawl job result. |
Output Example
{
"status" : "",
"result" : {
"status" : "",
"llm_result" : { },
"crawled_urls" : [ "" ],
"pages" : [ {
"url" : "",
"markdown" : ""
} ]
}
}Find Task ID
To find Task ID, click here.
Markdownify
Name: markdownify
Convert any webpage into clean, readable Markdown format.
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| website_url | Website URL | STRING | Website URL. | true |
Example JSON Structure
{
"label" : "Markdownify",
"name" : "markdownify",
"parameters" : {
"website_url" : ""
},
"type" : "scrapeGraphAi/v1/markdownify"
}Output
Type: OBJECT
Properties
| Name | Type | Description |
|---|---|---|
| request_id | STRING | Unique identifier for the request. |
| status | STRING | Status of the request. One of: “queued”, “processing”, “completed”, “failed”. |
| website_url | STRING | The original website URL that was submitted. |
| result | STRING | The search results. |
| error | STRING | Error message if the request failed. Empty string if successful. |
Output Example
{
"request_id" : "",
"status" : "",
"website_url" : "",
"result" : "",
"error" : ""
}Search Scraper
Name: searchScraper
Start a AI-powered web search request.
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| user_prompt | User Prompt | STRING | The search query or question you want to ask. | true |
Example JSON Structure
{
"label" : "Search Scraper",
"name" : "searchScraper",
"parameters" : {
"user_prompt" : ""
},
"type" : "scrapeGraphAi/v1/searchScraper"
}Output
Type: OBJECT
Properties
| Name | Type | Description |
|---|---|---|
| request_id | STRING | Unique identifier for the search request. |
| status | STRING | Status of the request. One of: “queued”, “processing”, “completed”, “failed”. |
| user_prompt | STRING | The original search query that was submitted. |
| result | OBJECT Properties{} | The search results. |
| reference_urls | ARRAY Items[STRING] | List of URLs that were used as references for the answer. |
| error | STRING | Error message if the request failed. Empty string if successful. |
Output Example
{
"request_id" : "",
"status" : "",
"user_prompt" : "",
"result" : { },
"reference_urls" : [ "" ],
"error" : ""
}Smart Scraper
Name: smartScraper
Extract content from a webpage using AI by providing a natural language prompt and a URL.
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| user_prompt | User Prompt | STRING | The search query or question you want to ask. | true |
| website_url | Website URL | STRING | Website URL. | true |
Example JSON Structure
{
"label" : "Smart Scraper",
"name" : "smartScraper",
"parameters" : {
"user_prompt" : "",
"website_url" : ""
},
"type" : "scrapeGraphAi/v1/smartScraper"
}Output
Type: OBJECT
Properties
| Name | Type | Description |
|---|---|---|
| request_id | STRING | Unique identifier for the search request. |
| status | STRING | Status of the request. One of: “queued”, “processing”, “completed”, “failed”. |
| website_url | STRING | The original website URL that was submitted. |
| user_prompt | STRING | The original search query that was submitted. |
| result | OBJECT Properties{} | The search results. |
| error | STRING | Error message if the request failed. Empty string if successful. |
Output Example
{
"request_id" : "",
"status" : "",
"website_url" : "",
"user_prompt" : "",
"result" : { },
"error" : ""
}Start SmartCrawler
Name: startCrawl
Start a new web crawl request with AI extraction or markdown conversion.
Properties
| Name | Label | Type | Description | Required |
|---|---|---|---|---|
| url | URL | STRING | The starting URL for the crawl. | true |
| prompt | Prompt | STRING | Instructions for data extraction. Required when extraction_mode is true. | false |
| extraction_mode | Extraction Mode | BOOLEAN Optionstrue, false | When false, enables markdown conversion mode (2 credits per page). Default is true. | false |
| cache_website | Cache Website | BOOLEAN Optionstrue, false | Whether to cache the website content. | false |
| depth | Depth | INTEGER | Maximum crawl depth. | false |
| max_pages | Max Pages | INTEGER | Maximum number of pages to crawl. | false |
| same_domain_only | Same Domain Only | BOOLEAN Optionstrue, false | Whether to crawl only the same domain. | false |
| batch_size | Batch Size | INTEGER | Number of pages to process in each batch. | false |
| schema | Schema | OBJECT Properties{} | JSON Schema object for structured output. | false |
| rules | Rules | OBJECT Properties{[STRING](exclude), [STRING](include_paths), [STRING](exclude_paths), BOOLEAN(same_domain)} | Crawl rules for filtering URLs. | false |
| sitemap | Sitemap | BOOLEAN Optionstrue, false | Use sitemap.xml for discovery. | false |
| render_heavy_js | Render Heavy JS | BOOLEAN Optionstrue, false | Enable heavy JavaScript rendering. | false |
| stealth | Stealth | BOOLEAN Optionstrue, false | Enable stealth mode to bypass bot protection using advanced anti-detection techniques. Adds +4 credits to the request cost. | false |
Example JSON Structure
{
"label" : "Start SmartCrawler",
"name" : "startCrawl",
"parameters" : {
"url" : "",
"prompt" : "",
"extraction_mode" : false,
"cache_website" : false,
"depth" : 1,
"max_pages" : 1,
"same_domain_only" : false,
"batch_size" : 1,
"schema" : { },
"rules" : {
"exclude" : [ "" ],
"include_paths" : [ "" ],
"exclude_paths" : [ "" ],
"same_domain" : false
},
"sitemap" : false,
"render_heavy_js" : false,
"stealth" : false
},
"type" : "scrapeGraphAi/v1/startCrawl"
}Output
Type: OBJECT
Properties
| Name | Type | Description |
|---|---|---|
| task_id | STRING | Unique identifier for the crawl task. Use this task_id to retrieve the crawl result. |
Output Example
{
"task_id" : ""
}What to do if your action is not listed here?
If this component doesn't have the action you need, you can use Custom Action to create your own. Custom Actions empower you to define HTTP requests tailored to your specific requirements, allowing for greater flexibility in integrating with external services or APIs.
To create a Custom Action, simply specify the desired HTTP method, path, and any necessary parameters. This way, you can extend the functionality of your component beyond the predefined actions, ensuring that you can meet all your integration needs effectively.
Additional Instructions
How to find Task ID
Task ID can be found in the output of the following actions:
- Start SmartCrawler
How is this guide?
Last updated on