scrapegraphai.utils package

Subpackages

Submodules

scrapegraphai.utils.cleanup_code module

This utility function extracts the code from a given string.

scrapegraphai.utils.cleanup_code.extract_code(code: str) str

Module for extracting code

scrapegraphai.utils.cleanup_html module

Module for minimizing the code

scrapegraphai.utils.cleanup_html.cleanup_html(html_content: str, base_url: str) str

Processes HTML content by removing unnecessary tags, minifying the HTML, and extracting the title and body content.

Parameters:

html_content (str) – The HTML content to be processed.

Returns:

A string combining the parsed title and the minified body content. If no body content is found, it indicates so.

Return type:

str

Example

>>> html_content = "<html><head><title>Example</title></head><body><p>Hello World!</p></body></html>"
>>> remover(html_content)
'Title: Example, Body: <body><p>Hello World!</p></body>'

This function is particularly useful for preparing HTML content for environments where bandwidth usage needs to be minimized.

scrapegraphai.utils.cleanup_html.minify()
scrapegraphai.utils.cleanup_html.minify_html(html)

minify_html function

scrapegraphai.utils.cleanup_html.reduce_html(html, reduction)

Reduces the size of the HTML content based on the specified level of reduction.

Parameters:
  • html (str) – The HTML content to reduce.

  • reduction (int) – The level of reduction to apply to the HTML content. 0: minification only, 1: minification and removig unnecessary tags and attributes, 2: minification, removig unnecessary tags and attributes, simplifying text content, removing of the head tag

Returns:

The reduced HTML content based on the specified reduction level.

Return type:

str

scrapegraphai.utils.code_error_analysis module

This module contains the functions that generate prompts for various types of code error analysis.

Functions: - syntax_focused_analysis: Focuses on syntax-related errors in the generated code. - execution_focused_analysis: Focuses on execution-related errors, including generated code and HTML analysis. - validation_focused_analysis: Focuses on validation-related errors, considering JSON schema and execution result. - semantic_focused_analysis: Focuses on semantic differences in generated code based on a comparison result.

scrapegraphai.utils.code_error_analysis.execution_focused_analysis(state: dict, llm_model) str

Analyzes the execution errors in the generated code and HTML code.

Parameters:
  • state (dict) – Contains the ‘generated_code’, ‘errors’, ‘html_code’, and ‘html_analysis’.

  • llm_model – The language model used for generating the analysis.

Returns:

The result of the execution error analysis.

Return type:

str

scrapegraphai.utils.code_error_analysis.semantic_focused_analysis(state: dict, comparison_result: Dict[str, Any], llm_model) str

Analyzes the semantic differences in the generated code based on a comparison result.

Parameters:
  • state (dict) – Contains the ‘generated_code’.

  • comparison_result (Dict[str, Any]) – Contains

  • comparison. ('differences' and 'explanation' of the) –

  • llm_model – The language model used for generating the analysis.

Returns:

The result of the semantic error analysis.

Return type:

str

scrapegraphai.utils.code_error_analysis.syntax_focused_analysis(state: dict, llm_model) str

Analyzes the syntax errors in the generated code.

Parameters:
  • state (dict) – Contains the ‘generated_code’ and ‘errors’ related to syntax.

  • llm_model – The language model used for generating the analysis.

Returns:

The result of the syntax error analysis.

Return type:

str

scrapegraphai.utils.code_error_analysis.validation_focused_analysis(state: dict, llm_model) str

Analyzes the validation errors in the generated code based on a JSON schema.

Parameters:
  • state (dict) – Contains the ‘generated_code’, ‘errors’,

  • 'json_schema'

  • 'execution_result'. (and) –

  • llm_model – The language model used for generating the analysis.

Returns:

The result of the validation error analysis.

Return type:

str

scrapegraphai.utils.code_error_correction module

This module contains the functions for code generation to correct different types of errors.

Functions: - syntax_focused_code_generation: Generates corrected code based on syntax error analysis. - execution_focused_code_generation: Generates corrected code based on execution error analysis. - validation_focused_code_generation: Generates corrected code based on validation error analysis, considering JSON schema. - semantic_focused_code_generation: Generates corrected code based on semantic error analysis, comparing generated and reference results.

scrapegraphai.utils.code_error_correction.execution_focused_code_generation(state: dict, analysis: str, llm_model) str

Generates corrected code based on execution error analysis.

Parameters:
  • state (dict) – Contains the ‘generated_code’.

  • analysis (str) – The analysis of the execution errors.

  • llm_model – The language model used for generating the corrected code.

Returns:

The corrected code.

Return type:

str

scrapegraphai.utils.code_error_correction.semantic_focused_code_generation(state: dict, analysis: str, llm_model) str

Generates corrected code based on semantic error analysis.

Parameters:
  • state (dict) – Contains the ‘generated_code’, ‘execution_result’, and ‘reference_answer’.

  • analysis (str) – The analysis of the semantic differences.

  • llm_model – The language model used for generating the corrected code.

Returns:

The corrected code.

Return type:

str

scrapegraphai.utils.code_error_correction.syntax_focused_code_generation(state: dict, analysis: str, llm_model) str

Generates corrected code based on syntax error analysis.

Parameters:
  • state (dict) – Contains the ‘generated_code’.

  • analysis (str) – The analysis of the syntax errors.

  • llm_model – The language model used for generating the corrected code.

Returns:

The corrected code.

Return type:

str

scrapegraphai.utils.code_error_correction.validation_focused_code_generation(state: dict, analysis: str, llm_model) str

Generates corrected code based on validation error analysis.

Parameters:
  • state (dict) – Contains the ‘generated_code’ and ‘json_schema’.

  • analysis (str) – The analysis of the validation errors.

  • llm_model – The language model used for generating the corrected code.

Returns:

The corrected code.

Return type:

str

scrapegraphai.utils.convert_to_md module

convert_to_md module

scrapegraphai.utils.convert_to_md.convert_to_md(html: str, url: Optional[str] = None) str
Convert HTML to Markdown.

This function uses the html2text library to convert the provided HTML content to Markdown format. The function returns the converted Markdown content as a string.

Args: html (str): The HTML content to be converted.

Returns: str: The equivalent Markdown content.

Example: >>> convert_to_md(“<html><body><p>This is a paragraph.</p> <h1>This is a heading.</h1></body></html>”) ‘This is a paragraph.

# This is a heading.’

Note: All the styles and links are ignored during the conversion.

scrapegraphai.utils.copy module

copy module

exception scrapegraphai.utils.copy.DeepCopyError

Bases: Exception

Custom exception raised when an object cannot be deep-copied.

scrapegraphai.utils.copy.is_boto3_client(obj)

Function for understanding if the script is using boto3 or not

scrapegraphai.utils.copy.safe_deepcopy(obj: Any) Any

Safely create a deep copy of an object, handling special cases.

Parameters:

obj – Object to copy

Returns:

Deep copy of the object

Raises:

DeepCopyError – If object cannot be deep copied

scrapegraphai.utils.custom_callback module

Custom callback for LLM token usage statistics.

This module has been taken and modified from the OpenAI callback manager in langchian-community. https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/callbacks/openai_info.py

class scrapegraphai.utils.custom_callback.CustomCallbackHandler(llm_model_name: str)

Bases: BaseCallbackHandler

Callback Handler that tracks LLMs info.

property always_verbose: bool

Whether to call verbose callbacks even if verbose is False.

completion_tokens: int = 0
on_llm_end(response: LLMResult, **kwargs: Any) None

Collect token usage.

on_llm_new_token(token: str, **kwargs: Any) None

Print out the token.

on_llm_start(serialized: Dict[str, Any], prompts: List[str], **kwargs: Any) None

Print out the prompts.

prompt_tokens: int = 0
successful_requests: int = 0
total_cost: float = 0.0
total_tokens: int = 0
scrapegraphai.utils.custom_callback.get_custom_callback(llm_model_name: str)

Function to get custom callback for LLM token usage statistics.

scrapegraphai.utils.custom_callback.get_token_cost_for_model(model_name: str, num_tokens: int, is_completion: bool = False) float

Get the cost in USD for a given model and number of tokens.

Parameters:
  • model_name – Name of the model

  • num_tokens – Number of tokens.

  • is_completion – Whether the model is used for completion or not. Defaults to False.

Returns:

Cost in USD.

scrapegraphai.utils.data_export module

data_export module This module provides functions to export data to various file formats.

scrapegraphai.utils.data_export.export_to_csv(data: List[Dict[str, Any]], filename: str) None

Export data to a CSV file.

Parameters:
  • data – List of dictionaries containing the data to export

  • filename – Name of the file to save the CSV data

scrapegraphai.utils.data_export.export_to_json(data: List[Dict[str, Any]], filename: str) None

Export data to a JSON file.

Parameters:
  • data – List of dictionaries containing the data to export

  • filename – Name of the file to save the JSON data

scrapegraphai.utils.data_export.export_to_xml(data: List[Dict[str, Any]], filename: str, root_element: str = 'data') None

Export data to an XML file.

Parameters:
  • data – List of dictionaries containing the data to export

  • filename – Name of the file to save the XML data

  • root_element – Name of the root element in the XML structure

scrapegraphai.utils.dict_content_compare module

This module contains utility functions for comparing the content of two dictionaries.

Functions: - normalize_dict: Recursively normalizes the values in a dictionary, converting strings to lowercase and stripping whitespace. - normalize_list: Recursively normalizes the values in a list, converting strings to lowercase and stripping whitespace. - are_content_equal: Compares two dictionaries for semantic equality after normalization.

scrapegraphai.utils.dict_content_compare.are_content_equal(generated_result: Dict[str, Any], reference_result: Dict[str, Any]) bool

Compares two dictionaries for semantic equality after normalization.

Parameters:
  • generated_result (Dict[str, Any]) – The generated result dictionary.

  • reference_result (Dict[str, Any]) – The reference result dictionary.

Returns:

True if the normalized dictionaries are equal, False otherwise.

Return type:

bool

scrapegraphai.utils.dict_content_compare.normalize_dict(d: Dict[str, Any]) Dict[str, Any]

Recursively normalizes the values in a dictionary.

Parameters:

d (Dict[str, Any]) – The dictionary to normalize.

Returns:

A normalized dictionary with strings converted to lowercase and stripped of whitespace.

Return type:

Dict[str, Any]

scrapegraphai.utils.dict_content_compare.normalize_list(lst: List[Any]) List[Any]

Recursively normalizes the values in a list.

Parameters:

lst (List[Any]) – The list to normalize.

Returns:

A normalized list with strings converted to lowercase and stripped of whitespace.

Return type:

List[Any]

scrapegraphai.utils.llm_callback_manager module

This module provides a custom callback manager for LLM models.

Classes: - CustomLLMCallbackManager: Manages exclusive access to callbacks for different types of LLM models.

class scrapegraphai.utils.llm_callback_manager.CustomLLMCallbackManager

Bases: object

CustomLLMCallbackManager class provides a mechanism to acquire a callback for LLM models in an exclusive, thread-safe manner.

Attributes: _lock (threading.Lock): Ensures that only one callback can be acquired at a time.

Methods: exclusive_get_callback: A context manager that yields the appropriate callback based on the LLM model and its name, ensuring exclusive access to the callback.

exclusive_get_callback(llm_model, llm_model_name)

Provides an exclusive callback for the LLM model in a thread-safe manner.

Parameters:
  • llm_model – The LLM model instance (e.g., ChatOpenAI, AzureChatOpenAI, ChatBedrock).

  • llm_model_name (str) – The name of the LLM model, used for model-specific callbacks.

Yields:

The appropriate callback for the LLM model, or None if the lock is unavailable.

scrapegraphai.utils.logging module

A centralized logging system for any library. This module provides functions to manage logging for a library. It includes functions to get and set the verbosity level, add and remove handlers, and control propagation. It also includes a function to set formatting for all handlers bound to the root logger. Source code inspired by: https://gist.github.com/DiTo97/9a0377f24236b66134eb96da1ec1693f

scrapegraphai.utils.logging.get_logger(name: Optional[str] = None) Logger

Get a logger with the specified name.

If no name is provided, the root logger for the library is returned.

Parameters:
  • name (Optional[str]) – The name of the logger.

  • None (If) –

  • returned. (the root logger for the library is) –

Returns:

The logger with the specified name.

Return type:

logging.Logger

scrapegraphai.utils.logging.get_verbosity() int

Get the current verbosity level of the root logger for the library.

Returns:

The current verbosity level of the root logger for the library.

Return type:

int

scrapegraphai.utils.logging.setDEFAULT_HANDLER() None

Add the default handler to the root logger for the library.

scrapegraphai.utils.logging.set_formatting() None

Set formatting for all handlers bound to the root logger for the library.

The formatting is set to: “[levelname|filename:lineno] time >> message”

scrapegraphai.utils.logging.set_handler(handler: Handler) None

Add a handler to the root logger for the library.

Parameters:

handler (logging.Handler) – The handler to add.

scrapegraphai.utils.logging.set_propagation() None

Enable propagation of the root logger for the library.

scrapegraphai.utils.logging.set_verbosity(verbosity: int) None

Set the verbosity level of the root logger for the library.

Parameters:

verbosity (int) – The verbosity level to set.

scrapegraphai.utils.logging.set_verbosity_debug() None

Set the verbosity level of the root logger for the library to DEBUG.

scrapegraphai.utils.logging.set_verbosity_error() None

Set the verbosity level of the root logger for the library to ERROR.

scrapegraphai.utils.logging.set_verbosity_fatal() None

Set the verbosity level of the root logger for the library to FATAL.

scrapegraphai.utils.logging.set_verbosity_info() None

Set the verbosity level of the root logger for the library to INFO.

scrapegraphai.utils.logging.set_verbosity_warning() None

Set the verbosity level of the root logger for the library to WARNING.

scrapegraphai.utils.logging.unsetDEFAULT_HANDLER() None

Remove the default handler from the root logger for the library.

scrapegraphai.utils.logging.unset_formatting() None

Remove formatting for all handlers bound to the root logger for the library.

scrapegraphai.utils.logging.unset_handler(handler: Handler) None

Remove a handler from the root logger for the library.

Parameters:

handler (logging.Handler) – The handler to remove.

scrapegraphai.utils.logging.unset_propagation() None

Disable propagation of the root logger for the library.

scrapegraphai.utils.logging.warning_once(self, *args, **kwargs)

Emit a warning log with the same message only once.

This function is added as a method to the logging.Logger class. It emits a warning log with the same message only once, even if it is called multiple times with the same message.

Parameters:
  • *args – The arguments to pass to the logging.Logger.warning method.

  • **kwargs – The keyword arguments to pass to the logging.Logger.warning method.

scrapegraphai.utils.model_costs module

Cost for 1k tokens in input

scrapegraphai.utils.model_costs.MODEL_COST_PER_1K_TOKENS_INPUT = {'a121.ju-mid-v1': 0.0125, 'a121.ju-ultra-v1': 0.0188, 'ai21.jamba-instruct-v1:0': 0.0005, 'amazon.titan-text-express-v1': 0.0002, 'amazon.titan-text-lite-v1': 0.00015, 'amazon.titan-text-premier-v1:0': 0.0005, 'codestral': 0.0002, 'codestral-2405': 0.0002, 'cohere.command-light-text-v14': 0.0003, 'cohere.command-r-plus-v1:0': 0.003, 'cohere.command-r-v1:0': 0.0005, 'cohere.command-text-v14': 0.0015, 'meta.llama2-13b-chat-v1': 0.00075, 'meta.llama2-70b-chat-v1': 0.00195, 'meta.llama3-1-405b-instruct-v1:0': 0.00532, 'meta.llama3-1-70b-instruct-v1:0': 0.00099, 'meta.llama3-1-8b-instruct-v1:0': 0.00022, 'meta.llama3-70b-instruct-v1:0': 0.00265, 'meta.llama3-8b-instruct-v1:0': 0.0003, 'mistral-large': 0.002, 'mistral-large-2407': 0.002, 'mistral-medium-latest': 0.00275, 'mistral-small': 0.0002, 'mistral-small-2409': 0.0002, 'mistral-small-latest': 0.001, 'mistral.mistral-7b-instruct-v0:2': 0.00015, 'mistral.mistral-large-2402-v1:0': 0.004, 'mistral.mistral-large-2407-v1:0': 0.002, 'mistral.mistral-small-2402-v1:0': 0.001, 'mistral.mixtral-7x8b-instruct-v0:1': 0.00045, 'open-mistral-7b': 0.00025, 'open-mistral-nemo': 0.00015, 'open-mistral-nemo-2407': 0.00015, 'open-mixtral-8x22b': 0.002, 'open-mixtral-8x7b': 0.0007, 'pixtral-12b': 0.00015, 'pixtral-12b-2409': 0.00015}

Cost for 1k tokens in output

scrapegraphai.utils.output_parser module

scrapegraphai.utils.parse_state_keys module

Parse_state_key module

scrapegraphai.utils.parse_state_keys.parse_expression(expression, state: dict) list

Parses a complex boolean expression involving state keys.

Parameters:
  • expression (str) – The boolean expression to parse.

  • state (dict) – Dictionary of state keys used to evaluate the expression.

Raises:
  • ValueError – If the expression is empty, has adjacent state keys without operators,

  • invalid operator usage, unbalanced parentheses, or if no state keys match the expression.

Returns:

A list of state keys that match the boolean expression, ensuring each key appears only once.

Return type:

list

Example

>>> parse_expression("user_input & (relevant_chunks | parsed_document | document)",
                    {"user_input": None, "document": None,
                    "parsed_document": None, "relevant_chunks": None})
['user_input', 'relevant_chunks', 'parsed_document', 'document']

This function evaluates the expression to determine the logical inclusion of state keys based on provided boolean logic. It checks for syntax errors such as unbalanced parentheses, incorrect adjacency of operators, and empty expressions.

scrapegraphai.utils.prettify_exec_info module

Prettify the execution information of the graph.

scrapegraphai.utils.prettify_exec_info.prettify_exec_info(complete_result: list[dict]) DataFrame

Transforms the execution information of a graph into a DataFrame for enhanced visualization.

Parameters:

complete_result (list[dict]) – The complete execution information of the graph.

Returns:

A DataFrame that organizes the execution information for better readability and analysis.

Return type:

pd.DataFrame

Example

>>> prettify_exec_info([{'node': 'A', 'status': 'success'},
  {'node': 'B', 'status': 'failure'}])
DataFrame with columns 'node' and 'status' showing execution results for each node.

scrapegraphai.utils.proxy_rotation module

Module for rotating proxies

class scrapegraphai.utils.proxy_rotation.Proxy

Bases: dict

proxy server information

bypass: str
criteria: ProxyBrokerCriteria
password: str
server: str
username: str
class scrapegraphai.utils.proxy_rotation.ProxyBrokerCriteria

Bases: TypedDict

proxy broker criteria

anonymous: bool
countryset: Set[str]
search_outside_if_empty: bool
secure: bool
timeout: float
class scrapegraphai.utils.proxy_rotation.ProxySettings

Bases: TypedDict

proxy settings

bypass: str
password: str
server: str
username: str
scrapegraphai.utils.proxy_rotation.is_ipv4_address(address: str) bool

If a proxy address conforms to a IPv4 address

scrapegraphai.utils.proxy_rotation.parse_or_search_proxy(proxy: Proxy) ProxySettings

parses a proxy configuration or searches for a new one matching the specified broker criteria

Parameters:

proxy – The proxy configuration to parse or search for.

Returns:

A ‘playwright’ compliant proxy configuration.

Notes

  • If the proxy server is a IP address, it is assumed to be

a proxy server address. - If the proxy server is ‘broker’, a proxy server is searched for based on the provided broker criteria.

Example

>>> proxy = {
...     "server": "broker",
...     "criteria": {
...         "anonymous": True,
...         "countryset": {"GB", "US"},
...         "secure": True,
...         "timeout": 5.0
...         "search_outside_if_empty": False
...     }
... }
>>> parse_or_search_proxy(proxy)
{
    "server": "<proxy-server-matching-criteria>",
}

Example

>>> proxy = {
...     "server": "192.168.1.1:8080",
...     "username": "<username>",
...     "password": "<password>"
... }
>>> parse_or_search_proxy(proxy)
{
    "server": "192.168.1.1:8080",
    "username": "<username>",
    "password": "<password>"
}
scrapegraphai.utils.proxy_rotation.search_proxy_servers(anonymous: bool = True, countryset: Optional[Set[str]] = None, secure: bool = False, timeout: float = 5.0, max_shape: int = 5, search_outside_if_empty: bool = True) List[str]

search for proxy servers that match the specified broker criteria

Parameters:
  • anonymous – whether proxy servers should have minimum level-1 anonymity.

  • countryset – admissible proxy servers locations.

  • secure – whether proxy servers should support HTTP or HTTPS; defaults to HTTP;

  • timeout – The maximum timeout for proxy responses; defaults to 5.0 seconds.

  • max_shape – The maximum number of proxy servers to return; defaults to 5.

  • search_outside_if_empty – whether countryset should be extended if empty.

Returns:

A list of proxy server URLs matching the criteria.

Example

>>> search_proxy_servers(
...     anonymous=True,
...     countryset={"GB", "US"},
...     secure=True,
...     timeout=1.0
...     max_shape=2
... )
[
    "http://103.10.63.135:8080",
    "http://113.20.31.250:8080",
]

scrapegraphai.utils.research_web module

scrapegraphai.utils.save_audio_from_bytes module

This utility function saves the byte response as an audio file.

scrapegraphai.utils.save_audio_from_bytes.save_audio_from_bytes(byte_response: bytes, output_path: Union[str, Path]) None

Saves the byte response as an audio file to the specified path.

Parameters:
  • byte_response (bytes) – The byte array containing audio data.

  • output_path (Union[str, Path]) – The destination

  • saved. (file path where the audio file will be) –

Example

>>> save_audio_from_bytes(b'audio data', 'path/to/audio.mp3')

This function writes the byte array containing audio data to a file, saving it as an audio file.

scrapegraphai.utils.save_code_to_file module

save_code_to_file module

scrapegraphai.utils.save_code_to_file.save_code_to_file(code: str, filename: str) None

Saves the generated code to a Python file.

Parameters:
  • code (str) – The generated code to be saved.

  • filename (str) – name of the output file

scrapegraphai.utils.schema_trasform module

This utility function trasfrom the pydantic schema into a more comprehensible schema.

scrapegraphai.utils.schema_trasform.transform_schema(pydantic_schema)

Transform the pydantic schema into a more comprehensible JSON schema.

Parameters:

pydantic_schema (dict) – The pydantic schema.

Returns:

The transformed JSON schema.

Return type:

dict

scrapegraphai.utils.split_text_into_chunks module

split_text_into_chunks module

scrapegraphai.utils.split_text_into_chunks.split_text_into_chunks(text: str, chunk_size: int, model: BaseChatModel, use_semchunk=True) List[str]

Splits the text into chunks based on the number of tokens.

Parameters:
  • text (str) – The text to split.

  • chunk_size (int) – The maximum number of tokens per chunk.

Returns:

A list of text chunks.

Return type:

List[str]

scrapegraphai.utils.sys_dynamic_import module

high-level module for dynamic importing of python modules at runtime

source code inspired by https://gist.github.com/DiTo97/46f4b733396b8d7a8f1d4d22db902cfc

scrapegraphai.utils.sys_dynamic_import.dynamic_import(modname: str, message: str = '') None

imports a python module at runtime

Parameters:
  • modname – The module name in the scope

  • message – The display message in case of error

Raises:

ImportError – If the module cannot be imported at runtime

scrapegraphai.utils.sys_dynamic_import.srcfile_import(modpath: str, modname: str) types.ModuleType

imports a python module from its srcfile

Parameters:
  • modpath – The srcfile absolute path

  • modname – The module name in the scope

Returns:

The imported module

Raises:

ImportError – If the module cannot be imported from the srcfile

scrapegraphai.utils.tokenizer module

Module for counting tokens and splitting text into chunks

scrapegraphai.utils.tokenizer.num_tokens_calculus(string: str, llm_model: BaseChatModel) int

Returns the number of tokens in a text string.

Module contents

__init__.py file for utils folder