scrapegraphai.nodes package¶
Submodules¶
scrapegraphai.nodes.base_node module¶
This module defines the base node class for the ScrapeGraphAI application.
- class scrapegraphai.nodes.base_node.BaseNode(node_name: str, node_type: str, input: str, output: List[str], min_input_len: int = 1, node_config: Optional[dict] = None)¶
Bases:
ABC
An abstract base class for nodes in a graph-based workflow, designed to perform specific actions when executed.
- node_name¶
The unique identifier name for the node.
- Type:
str
- input¶
Boolean expression defining the input keys needed from the state.
- Type:
str
- output¶
List of
- Type:
List[str]
- min_input_len¶
Minimum required number of input keys.
- Type:
int
- node_config¶
Additional configuration for the node.
- Type:
Optional[dict]
- logger¶
The centralized root logger
- Type:
logging.Logger
- Parameters:
node_name (str) – Name for identifying the node.
node_type (str) – Type of the node; must be ‘node’ or ‘conditional_node’.
input (str) – Expression defining the input keys needed from the state.
output (List[str]) – List of output keys to be updated in the state.
min_input_len (int, optional) – Minimum required number of input keys; defaults to 1.
node_config (Optional[dict], optional) – Additional configuration for the node; defaults to None.
- Raises:
ValueError – If node_type is not one of the allowed types.
Example
>>> class MyNode(BaseNode): ... def execute(self, state): ... # Implementation of node logic here ... return state ... >>> my_node = MyNode("ExampleNode", "node", "input_spec", ["output_spec"]) >>> updated_state = my_node.execute({'key': 'value'}) {'key': 'value'}
- abstract execute(state: dict) dict ¶
Execute the node’s logic based on the current state and update it accordingly.
- Parameters:
state (dict) – The current state of the graph.
- Returns:
The updated state after executing the node’s logic.
- Return type:
dict
- get_input_keys(state: dict) List[str] ¶
Determines the necessary state keys based on the input specification.
- Parameters:
state (dict) – The current state of the graph used to parse input keys.
- Returns:
A list of input keys required for node operation.
- Return type:
List[str]
- Raises:
ValueError – If error occurs in parsing input keys.
- update_config(params: dict, overwrite: bool = False)¶
Updates the node_config dictionary as well as attributes with same key.
- Parameters:
param (dict) – The dictionary to update node_config with.
overwrite (bool) – Flag indicating if the values of node_config
None. (should be overwritten if their value is not) –
scrapegraphai.nodes.concat_answers_node module¶
scrapegraphai.nodes.conditional_node module¶
scrapegraphai.nodes.description_node module¶
scrapegraphai.nodes.fetch_node module¶
FetchNode Module
- class scrapegraphai.nodes.fetch_node.FetchNode(input: str, output: List[str], node_config: Optional[dict] = None, node_name: str = 'Fetch')¶
Bases:
BaseNode
A node responsible for fetching the HTML content of a specified URL and updating the graph’s state with this content. It uses ChromiumLoader to fetch the content from a web page asynchronously (with proxy protection).
This node acts as a starting point in many scraping workflows, preparing the state with the necessary HTML content for further processing by subsequent nodes in the graph.
- headless¶
A flag indicating whether the browser should run in headless mode.
- Type:
bool
- verbose¶
A flag indicating whether to print verbose output during execution.
- Type:
bool
- Parameters:
input (str) – Boolean expression defining the input keys needed from the state.
output (List[str]) – List of output keys to be updated in the state.
node_config (Optional[dict]) – Additional configuration for the node.
node_name (str) – The unique identifier name for the node, defaulting to “Fetch”.
- execute(state)¶
Executes the node’s logic to fetch HTML content from a specified URL and update the state with this content.
- handle_directory(state, input_type, source)¶
Handles the directory by compressing the source document and updating the state.
Parameters: state (dict): The current state of the graph. input_type (str): The type of input being processed. source (str): The source document to be compressed.
Returns: dict: The updated state with the compressed document.
- handle_file(state, input_type, source)¶
Loads the content of a file based on its input type.
Parameters: state (dict): The current state of the graph. input_type (str): The type of the input file (e.g., “pdf”, “csv”, “json”, “xml”, “md”). source (str): The path to the source file.
Returns: dict: The updated state with the compressed document.
The function supports the following input types: - “pdf”: Uses PyPDFLoader to load the content of a PDF file. - “csv”: Reads the content of a CSV file using pandas and converts it to a string. - “json”: Loads the content of a JSON file. - “xml”: Reads the content of an XML file as a string. - “md”: Reads the content of a Markdown file as a string.
- handle_local_source(state, source)¶
Handles the local source by fetching HTML content, optionally converting it to Markdown, and updating the state.
Parameters: state (dict): The current state of the graph. source (str): The HTML content from the local source.
Returns: dict: The updated state with the processed content.
Raises: ValueError: If the source is empty or contains only whitespace.
- handle_web_source(state, source)¶
Handles the web source by fetching HTML content from a URL, optionally converting it to Markdown, and updating the state.
Parameters: state (dict): The current state of the graph. source (str): The URL of the web source to fetch HTML content from.
Returns: dict: The updated state with the processed content.
Raises: ValueError: If the fetched HTML content is empty or contains only whitespace.
- is_valid_url(source: str) bool ¶
Validates if the source string is a valid URL using regex.
Parameters: source (str): The URL string to validate
Raises: ValueError: If the URL is invalid
- load_file_content(source, input_type)¶
Loads the content of a file based on its input type.
Parameters: source (str): The path to the source file. input_type (str): The type of the input file (e.g., “pdf”, “csv”, “json”, “xml”, “md”).
Returns: list: A list containing a Document object with the loaded content and metadata.