scrapegraphai.docloaders package¶
Submodules¶
scrapegraphai.docloaders.browser_base module¶
browserbase integration module
- scrapegraphai.docloaders.browser_base.browser_base_fetch(api_key: str, project_id: str, link: List[str], text_content: bool = True, async_mode: bool = False) List[str] ¶
BrowserBase Fetch
This module provides an interface to the BrowserBase API.
The browser_base_fetch function takes three arguments: - api_key: The API key provided by BrowserBase. - project_id: The ID of the project on BrowserBase where you want to fetch data from. - link: The URL or link that you want to fetch data from. - text_content: A boolean flag to specify whether to return only the
text content (True) or the full HTML (False).
- async_mode: A boolean flag that determines whether the function runs asynchronously
(True) or synchronously (False, default).
It initializes a Browserbase object with the given API key and project ID, then uses this object to load the specified link. It returns the result of the loading operation.
Example usage:
``` from browser_base_fetch import browser_base_fetch
result = browser_base_fetch(api_key=”your_api_key”, project_id=”your_project_id”, link=”https://example.com”) print(result) ```
Please note that you need to replace “your_api_key” and “your_project_id” with your actual BrowserBase API key and project ID.
- Parameters:
api_key (str) – The API key provided by BrowserBase.
project_id (str) – The ID of the project on BrowserBase where you want to fetch data from.
link (str) – The URL or link that you want to fetch data from.
text_content (bool) – Whether to return only the text content
HTML ((True) or the full) –
async_mode (bool) – Whether to run the function asynchronously
synchronously ((True) or) –
- Returns:
The result of the loading operation.
- Return type:
object
scrapegraphai.docloaders.chromium module¶
- class scrapegraphai.docloaders.chromium.ChromiumLoader(urls: List[str], *, backend: str = 'playwright', headless: bool = True, proxy: Optional[Proxy] = None, load_state: str = 'domcontentloaded', requires_js_support: bool = False, **kwargs: Any)¶
Bases:
BaseLoader
Scrapes HTML pages from URLs using a (headless) instance of the Chromium web driver with proxy protection.
- backend¶
The web driver backend library; defaults to ‘playwright’.
- browser_config¶
A dictionary containing additional browser kwargs.
- headless¶
Whether to run browser in headless mode.
- proxy¶
A dictionary containing proxy settings; None disables protection.
- urls¶
A list of URLs to scrape content from.
- requires_js_support¶
Flag to determine if JS rendering is required.
- RETRY_LIMIT = 3¶
- TIMEOUT = 10¶
- async alazy_load() AsyncIterator[Document] ¶
Asynchronously load text content from the provided URLs.
This method leverages asyncio to initiate the scraping of all provided URLs simultaneously. It improves performance by utilizing concurrent asynchronous requests. Each Document is yielded as soon as its content is available, encapsulating the scraped content.
- Yields:
Document – A Document object containing the scraped content, along with its source URL as metadata.
- async ascrape_playwright(url: str) str ¶
Asynchronously scrape the content of a given URL using Playwright’s async API.
- Parameters:
url (str) – The URL to scrape.
- Returns:
The scraped HTML content or an error message if an exception occurs.
- Return type:
str
- async ascrape_undetected_chromedriver(url: str) str ¶
Asynchronously scrape the content of a given URL using undetected chrome with Selenium.
- Parameters:
url (str) – The URL to scrape.
- Returns:
The scraped HTML content or an error message if an exception occurs.
- Return type:
str
- async ascrape_with_js_support(url: str) str ¶
Asynchronously scrape the content of a given URL by rendering JavaScript using Playwright.
- Parameters:
url (str) – The URL to scrape.
- Returns:
The fully rendered HTML content after JavaScript execution, or an error message if an exception occurs.
- Return type:
str
- lazy_load() Iterator[Document] ¶
Lazily load text content from the provided URLs.
This method yields Documents one at a time as they’re scraped, instead of waiting to scrape all URLs before returning.
- Yields:
Document – The scraped content encapsulated within a Document object.
scrapegraphai.docloaders.scrape_do module¶
Scrape_do module
- scrapegraphai.docloaders.scrape_do.scrape_do_fetch(token, target_url, use_proxy=False, geoCode=None, super_proxy=False)¶
Fetches the IP address of the machine associated with the given URL using Scrape.do.
- Parameters:
token (str) – The API token for Scrape.do service.
target_url (str) – A valid web page URL to fetch its associated IP address.
use_proxy (bool) – Whether to use Scrape.do proxy mode. Default is False.
geoCode (str, optional) – Specify the country code for
None. (geolocation-based proxies. Default is) –
super_proxy (bool) – If True, use Residential & Mobile Proxy Networks. Default is False.
- Returns:
The raw response from the target URL.
- Return type:
str
Module contents¶
This module handles document loading functionalities for the ScrapeGraphAI application.