scrapegraphai.nodes package

Submodules

scrapegraphai.nodes.base_node module

This module defines the base node class for the ScrapeGraphAI application.

class scrapegraphai.nodes.base_node.BaseNode(node_name: str, node_type: str, input: str, output: List[str], min_input_len: int = 1, node_config: Optional[dict] = None)

Bases: ABC

An abstract base class for nodes in a graph-based workflow, designed to perform specific actions when executed.

node_name

The unique identifier name for the node.

Type:

str

input

Boolean expression defining the input keys needed from the state.

Type:

str

output

List of

Type:

List[str]

min_input_len

Minimum required number of input keys.

Type:

int

node_config

Additional configuration for the node.

Type:

Optional[dict]

logger

The centralized root logger

Type:

logging.Logger

Parameters:
  • node_name (str) – Name for identifying the node.

  • node_type (str) – Type of the node; must be ‘node’ or ‘conditional_node’.

  • input (str) – Expression defining the input keys needed from the state.

  • output (List[str]) – List of output keys to be updated in the state.

  • min_input_len (int, optional) – Minimum required number of input keys; defaults to 1.

  • node_config (Optional[dict], optional) – Additional configuration for the node; defaults to None.

Raises:

ValueError – If node_type is not one of the allowed types.

Example

>>> class MyNode(BaseNode):
...     def execute(self, state):
...         # Implementation of node logic here
...         return state
...
>>> my_node = MyNode("ExampleNode", "node", "input_spec", ["output_spec"])
>>> updated_state = my_node.execute({'key': 'value'})
{'key': 'value'}
abstract execute(state: dict) dict

Execute the node’s logic based on the current state and update it accordingly.

Parameters:

state (dict) – The current state of the graph.

Returns:

The updated state after executing the node’s logic.

Return type:

dict

get_input_keys(state: dict) List[str]

Determines the necessary state keys based on the input specification.

Parameters:

state (dict) – The current state of the graph used to parse input keys.

Returns:

A list of input keys required for node operation.

Return type:

List[str]

Raises:

ValueError – If error occurs in parsing input keys.

update_config(params: dict, overwrite: bool = False)

Updates the node_config dictionary as well as attributes with same key.

Parameters:
  • param (dict) – The dictionary to update node_config with.

  • overwrite (bool) – Flag indicating if the values of node_config

  • None. (should be overwritten if their value is not) –

scrapegraphai.nodes.concat_answers_node module

scrapegraphai.nodes.conditional_node module

scrapegraphai.nodes.description_node module

scrapegraphai.nodes.fetch_node module

FetchNode Module

class scrapegraphai.nodes.fetch_node.FetchNode(input: str, output: List[str], node_config: Optional[dict] = None, node_name: str = 'Fetch')

Bases: BaseNode

A node responsible for fetching the HTML content of a specified URL and updating the graph’s state with this content. It uses ChromiumLoader to fetch the content from a web page asynchronously (with proxy protection).

This node acts as a starting point in many scraping workflows, preparing the state with the necessary HTML content for further processing by subsequent nodes in the graph.

headless

A flag indicating whether the browser should run in headless mode.

Type:

bool

verbose

A flag indicating whether to print verbose output during execution.

Type:

bool

Parameters:
  • input (str) – Boolean expression defining the input keys needed from the state.

  • output (List[str]) – List of output keys to be updated in the state.

  • node_config (Optional[dict]) – Additional configuration for the node.

  • node_name (str) – The unique identifier name for the node, defaulting to “Fetch”.

execute(state)

Executes the node’s logic to fetch HTML content from a specified URL and update the state with this content.

handle_directory(state, input_type, source)

Handles the directory by compressing the source document and updating the state.

Parameters: state (dict): The current state of the graph. input_type (str): The type of input being processed. source (str): The source document to be compressed.

Returns: dict: The updated state with the compressed document.

handle_file(state, input_type, source)

Loads the content of a file based on its input type.

Parameters: state (dict): The current state of the graph. input_type (str): The type of the input file (e.g., “pdf”, “csv”, “json”, “xml”, “md”). source (str): The path to the source file.

Returns: dict: The updated state with the compressed document.

The function supports the following input types: - “pdf”: Uses PyPDFLoader to load the content of a PDF file. - “csv”: Reads the content of a CSV file using pandas and converts it to a string. - “json”: Loads the content of a JSON file. - “xml”: Reads the content of an XML file as a string. - “md”: Reads the content of a Markdown file as a string.

handle_local_source(state, source)

Handles the local source by fetching HTML content, optionally converting it to Markdown, and updating the state.

Parameters: state (dict): The current state of the graph. source (str): The HTML content from the local source.

Returns: dict: The updated state with the processed content.

Raises: ValueError: If the source is empty or contains only whitespace.

handle_web_source(state, source)

Handles the web source by fetching HTML content from a URL, optionally converting it to Markdown, and updating the state.

Parameters: state (dict): The current state of the graph. source (str): The URL of the web source to fetch HTML content from.

Returns: dict: The updated state with the processed content.

Raises: ValueError: If the fetched HTML content is empty or contains only whitespace.

is_valid_url(source: str) bool

Validates if the source string is a valid URL using regex.

Parameters: source (str): The URL string to validate

Raises: ValueError: If the URL is invalid

load_file_content(source, input_type)

Loads the content of a file based on its input type.

Parameters: source (str): The path to the source file. input_type (str): The type of the input file (e.g., “pdf”, “csv”, “json”, “xml”, “md”).

Returns: list: A list containing a Document object with the loaded content and metadata.

scrapegraphai.nodes.fetch_node_level_k module

scrapegraphai.nodes.fetch_screen_node module

scrapegraphai.nodes.generate_answer_csv_node module

scrapegraphai.nodes.generate_answer_from_image_node module

scrapegraphai.nodes.generate_answer_node module

scrapegraphai.nodes.generate_answer_node_k_level module

scrapegraphai.nodes.generate_answer_omni_node module

scrapegraphai.nodes.generate_code_node module

scrapegraphai.nodes.generate_scraper_node module

scrapegraphai.nodes.get_probable_tags_node module

GetProbableTagsNode Module

class scrapegraphai.nodes.get_probable_tags_node.GetProbableTagsNode(input: str, output: List[str], node_config: dict, node_name: str = 'GetProbableTags')

Bases: BaseNode

A node that utilizes a language model to identify probable HTML tags within a document that are likely to contain the information relevant to a user’s query. This node generates a prompt describing the task, submits it to the language model, and processes the output to produce a list of probable tags.

llm_model

An instance of the language model client used for tag predictions.

Parameters:
  • input (str) – Boolean expression defining the input keys needed from the state.

  • output (List[str]) – List of output keys to be updated in the state.

  • model_config (dict) – Additional configuration for the language model.

  • node_name (str) – The unique identifier name for the node, defaulting to “GetProbableTags”.

execute(state: dict) dict

Generates a list of probable HTML tags based on the user’s input and updates the state with this list. The method constructs a prompt for the language model, submits it, and parses the output to identify probable tags.

Parameters:

state (dict) – The current state of the graph. The input keys will be used to fetch the correct data types from the state.

Returns:

The updated state with the input key containing a list of probable HTML tags.

Return type:

dict

Raises:

KeyError – If input keys are not found in the state, indicating that the necessary information for generating tag predictions is missing.

scrapegraphai.nodes.graph_iterator_node module

scrapegraphai.nodes.html_analyzer_node module

scrapegraphai.nodes.image_to_text_node module

scrapegraphai.nodes.merge_answers_node module

scrapegraphai.nodes.merge_generated_scripts_node module

scrapegraphai.nodes.parse_node module

scrapegraphai.nodes.parse_node_depth_k_node module

scrapegraphai.nodes.prompt_refiner_node module

scrapegraphai.nodes.rag_node module

scrapegraphai.nodes.reasoning_node module

scrapegraphai.nodes.robots_node module

scrapegraphai.nodes.search_internet_node module

scrapegraphai.nodes.search_node_with_context module

scrapegraphai.nodes.text_to_speech_node module

Module contents