scrapegraph-py1.31.0
Published
ScrapeGraph Python SDK for API
pip install scrapegraph-py
Package Downloads
Requires Python
<4.0,>=3.10
🌐 ScrapeGraph Python SDK
Official Python SDK for the ScrapeGraph API - Smart web scraping powered by AI.
📦 Installation
pip install scrapegraph-py
🚀 Features
- 🤖 AI-powered web scraping and search
- 🕷️ Smart crawling with both AI extraction and markdown conversion modes
- 💰 Cost-effective markdown conversion (80% savings vs AI mode)
- 🔄 Both sync and async clients
- 📊 Structured output with Pydantic schemas
- 🔍 Detailed logging
- ⚡ Automatic retries
- 🔐 Secure authentication
🎯 Quick Start
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
[!NOTE] You can set the
SGAI_API_KEY
environment variable and initialize the client without parameters:client = Client()
📚 Available Endpoints
🤖 SmartScraper
Extract structured data from any webpage or HTML content using AI.
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
# Using a URL
response = client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the main heading and description"
)
# Or using HTML content
html_content = """
<html>
<body>
<h1>Company Name</h1>
<p>We are a technology company focused on AI solutions.</p>
</body>
</html>
"""
response = client.smartscraper(
website_html=html_content,
user_prompt="Extract the company description"
)
print(response)
Output Schema (Optional)
from pydantic import BaseModel, Field
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
class WebsiteData(BaseModel):
title: str = Field(description="The page title")
description: str = Field(description="The meta description")
response = client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the title and description",
output_schema=WebsiteData
)
🍪 Cookies Support
Use cookies for authentication and session management:
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
# Define cookies for authentication
cookies = {
"session_id": "abc123def456",
"auth_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
"user_preferences": "dark_mode,usd"
}
response = client.smartscraper(
website_url="https://example.com/dashboard",
user_prompt="Extract user profile information",
cookies=cookies
)
Common Use Cases:
- E-commerce sites: User authentication, shopping cart persistence
- Social media: Session management, user preferences
- Banking/Financial: Secure authentication, transaction history
- News sites: User preferences, subscription content
- API endpoints: Authentication tokens, API keys
🔄 Advanced Features
Infinite Scrolling:
response = client.smartscraper(
website_url="https://example.com/feed",
user_prompt="Extract all posts from the feed",
cookies=cookies,
number_of_scrolls=10 # Scroll 10 times to load more content
)
Pagination:
response = client.smartscraper(
website_url="https://example.com/products",
user_prompt="Extract all product information",
cookies=cookies,
total_pages=5 # Scrape 5 pages
)
Combined with Cookies:
response = client.smartscraper(
website_url="https://example.com/dashboard",
user_prompt="Extract user data from all pages",
cookies=cookies,
number_of_scrolls=5,
total_pages=3
)
🔍 SearchScraper
Perform AI-powered web searches with structured results and reference URLs.
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
response = client.searchscraper(
user_prompt="What is the latest version of Python and its main features?"
)
print(f"Answer: {response['result']}")
print(f"Sources: {response['reference_urls']}")
Output Schema (Optional)
from pydantic import BaseModel, Field
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
class PythonVersionInfo(BaseModel):
version: str = Field(description="The latest Python version number")
release_date: str = Field(description="When this version was released")
major_features: list[str] = Field(description="List of main features")
response = client.searchscraper(
user_prompt="What is the latest version of Python and its main features?",
output_schema=PythonVersionInfo
)
📝 Markdownify
Converts any webpage into clean, formatted markdown.
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
response = client.markdownify(
website_url="https://example.com"
)
print(response)
🕷️ Crawler
Intelligently crawl and extract data from multiple pages with support for both AI extraction and markdown conversion modes.
AI Extraction Mode (Default)
Extract structured data from multiple pages using AI:
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
# Define the data schema for extraction
schema = {
"type": "object",
"properties": {
"company_name": {"type": "string"},
"founders": {
"type": "array",
"items": {"type": "string"}
},
"description": {"type": "string"}
}
}
response = client.crawl(
url="https://scrapegraphai.com",
prompt="extract the company information and founders",
data_schema=schema,
depth=2,
max_pages=5,
same_domain_only=True
)
# Poll for results (crawl is asynchronous)
crawl_id = response.get("crawl_id")
result = client.get_crawl(crawl_id)
Markdown Conversion Mode (Cost-Effective)
Convert pages to clean markdown without AI processing (80% cheaper):
from scrapegraph_py import Client
client = Client(api_key="your-api-key-here")
response = client.crawl(
url="https://scrapegraphai.com",
extraction_mode=False, # Markdown conversion mode
depth=2,
max_pages=5,
same_domain_only=True,
sitemap=True # Use sitemap for better page discovery
)
# Poll for results
crawl_id = response.get("crawl_id")
result = client.get_crawl(crawl_id)
# Access markdown content
for page in result["result"]["pages"]:
print(f"URL: {page['url']}")
print(f"Markdown: {page['markdown']}")
print(f"Metadata: {page['metadata']}")
🔧 Crawl Parameters
- url (required): Starting URL for the crawl
- extraction_mode (default: True):
True
= AI extraction mode (requires prompt and data_schema)False
= Markdown conversion mode (no AI, 80% cheaper)
- prompt (required for AI mode): AI prompt to guide data extraction
- data_schema (required for AI mode): JSON schema defining extracted data structure
- depth (default: 2): Maximum crawl depth (1-10)
- max_pages (default: 2): Maximum pages to crawl (1-100)
- same_domain_only (default: True): Only crawl pages from the same domain
- sitemap (default: False): Use sitemap for better page discovery
- cache_website (default: True): Cache website content
- batch_size (optional): Batch size for processing pages (1-10)
Cost Comparison:
- AI Extraction Mode: ~10 credits per page
- Markdown Conversion Mode: ~2 credits per page (80% savings!)
⚡ Async Support
All endpoints support async operations:
import asyncio
from scrapegraph_py import AsyncClient
async def main():
async with AsyncClient() as client:
response = await client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the main content"
)
print(response)
asyncio.run(main())
📖 Documentation
For detailed documentation, visit docs.scrapegraphai.com
🛠️ Development
For information about setting up the development environment and contributing to the project, see our Contributing Guide.
💬 Support & Feedback
- 📧 Email: [email protected]
- 💻 GitHub Issues: Create an issue
- 🌟 Feature Requests: Request a feature
- ⭐ API Feedback: You can also submit feedback programmatically using the feedback endpoint:
from scrapegraph_py import Client client = Client(api_key="your-api-key-here") client.submit_feedback( request_id="your-request-id", rating=5, feedback_text="Great results!" )
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🔗 Links
Made with ❤️ by ScrapeGraph AI