scrapegraphai.utils package¶
Subpackages¶
Submodules¶
scrapegraphai.utils.cleanup_code module¶
This utility function extracts the code from a given string.
- scrapegraphai.utils.cleanup_code.extract_code(code: str) str ¶
Module for extracting code
scrapegraphai.utils.cleanup_html module¶
Module for minimizing the code
- scrapegraphai.utils.cleanup_html.cleanup_html(html_content: str, base_url: str) str ¶
Processes HTML content by removing unnecessary tags, minifying the HTML, and extracting the title and body content.
- Parameters:
html_content (str) – The HTML content to be processed.
- Returns:
A string combining the parsed title and the minified body content. If no body content is found, it indicates so.
- Return type:
str
Example
>>> html_content = "<html><head><title>Example</title></head><body><p>Hello World!</p></body></html>" >>> remover(html_content) 'Title: Example, Body: <body><p>Hello World!</p></body>'
This function is particularly useful for preparing HTML content for environments where bandwidth usage needs to be minimized.
- scrapegraphai.utils.cleanup_html.minify()¶
- scrapegraphai.utils.cleanup_html.minify_html(html)¶
minify_html function
- scrapegraphai.utils.cleanup_html.reduce_html(html, reduction)¶
Reduces the size of the HTML content based on the specified level of reduction.
- Parameters:
html (str) – The HTML content to reduce.
reduction (int) – The level of reduction to apply to the HTML content. 0: minification only, 1: minification and removig unnecessary tags and attributes, 2: minification, removig unnecessary tags and attributes, simplifying text content, removing of the head tag
- Returns:
The reduced HTML content based on the specified reduction level.
- Return type:
str
scrapegraphai.utils.code_error_analysis module¶
This module contains the functions that generate prompts for various types of code error analysis.
Functions: - syntax_focused_analysis: Focuses on syntax-related errors in the generated code. - execution_focused_analysis: Focuses on execution-related errors, including generated code and HTML analysis. - validation_focused_analysis: Focuses on validation-related errors, considering JSON schema and execution result. - semantic_focused_analysis: Focuses on semantic differences in generated code based on a comparison result.
- scrapegraphai.utils.code_error_analysis.execution_focused_analysis(state: dict, llm_model) str ¶
Analyzes the execution errors in the generated code and HTML code.
- Parameters:
state (dict) – Contains the ‘generated_code’, ‘errors’, ‘html_code’, and ‘html_analysis’.
llm_model – The language model used for generating the analysis.
- Returns:
The result of the execution error analysis.
- Return type:
str
- scrapegraphai.utils.code_error_analysis.semantic_focused_analysis(state: dict, comparison_result: Dict[str, Any], llm_model) str ¶
Analyzes the semantic differences in the generated code based on a comparison result.
- Parameters:
state (dict) – Contains the ‘generated_code’.
comparison_result (Dict[str, Any]) – Contains
comparison. ('differences' and 'explanation' of the) –
llm_model – The language model used for generating the analysis.
- Returns:
The result of the semantic error analysis.
- Return type:
str
- scrapegraphai.utils.code_error_analysis.syntax_focused_analysis(state: dict, llm_model) str ¶
Analyzes the syntax errors in the generated code.
- Parameters:
state (dict) – Contains the ‘generated_code’ and ‘errors’ related to syntax.
llm_model – The language model used for generating the analysis.
- Returns:
The result of the syntax error analysis.
- Return type:
str
- scrapegraphai.utils.code_error_analysis.validation_focused_analysis(state: dict, llm_model) str ¶
Analyzes the validation errors in the generated code based on a JSON schema.
- Parameters:
state (dict) – Contains the ‘generated_code’, ‘errors’,
'json_schema' –
'execution_result'. (and) –
llm_model – The language model used for generating the analysis.
- Returns:
The result of the validation error analysis.
- Return type:
str
scrapegraphai.utils.code_error_correction module¶
This module contains the functions for code generation to correct different types of errors.
Functions: - syntax_focused_code_generation: Generates corrected code based on syntax error analysis. - execution_focused_code_generation: Generates corrected code based on execution error analysis. - validation_focused_code_generation: Generates corrected code based on validation error analysis, considering JSON schema. - semantic_focused_code_generation: Generates corrected code based on semantic error analysis, comparing generated and reference results.
- scrapegraphai.utils.code_error_correction.execution_focused_code_generation(state: dict, analysis: str, llm_model) str ¶
Generates corrected code based on execution error analysis.
- Parameters:
state (dict) – Contains the ‘generated_code’.
analysis (str) – The analysis of the execution errors.
llm_model – The language model used for generating the corrected code.
- Returns:
The corrected code.
- Return type:
str
- scrapegraphai.utils.code_error_correction.semantic_focused_code_generation(state: dict, analysis: str, llm_model) str ¶
Generates corrected code based on semantic error analysis.
- Parameters:
state (dict) – Contains the ‘generated_code’, ‘execution_result’, and ‘reference_answer’.
analysis (str) – The analysis of the semantic differences.
llm_model – The language model used for generating the corrected code.
- Returns:
The corrected code.
- Return type:
str
- scrapegraphai.utils.code_error_correction.syntax_focused_code_generation(state: dict, analysis: str, llm_model) str ¶
Generates corrected code based on syntax error analysis.
- Parameters:
state (dict) – Contains the ‘generated_code’.
analysis (str) – The analysis of the syntax errors.
llm_model – The language model used for generating the corrected code.
- Returns:
The corrected code.
- Return type:
str
- scrapegraphai.utils.code_error_correction.validation_focused_code_generation(state: dict, analysis: str, llm_model) str ¶
Generates corrected code based on validation error analysis.
- Parameters:
state (dict) – Contains the ‘generated_code’ and ‘json_schema’.
analysis (str) – The analysis of the validation errors.
llm_model – The language model used for generating the corrected code.
- Returns:
The corrected code.
- Return type:
str
scrapegraphai.utils.convert_to_md module¶
convert_to_md module
- scrapegraphai.utils.convert_to_md.convert_to_md(html: str, url: Optional[str] = None) str ¶
- Convert HTML to Markdown.
This function uses the html2text library to convert the provided HTML content to Markdown format. The function returns the converted Markdown content as a string.
Args: html (str): The HTML content to be converted.
Returns: str: The equivalent Markdown content.
Example: >>> convert_to_md(“<html><body><p>This is a paragraph.</p> <h1>This is a heading.</h1></body></html>”) ‘This is a paragraph.
# This is a heading.’
Note: All the styles and links are ignored during the conversion.
scrapegraphai.utils.copy module¶
copy module
- exception scrapegraphai.utils.copy.DeepCopyError¶
Bases:
Exception
Custom exception raised when an object cannot be deep-copied.
- scrapegraphai.utils.copy.is_boto3_client(obj)¶
Function for understanding if the script is using boto3 or not
- scrapegraphai.utils.copy.safe_deepcopy(obj: Any) Any ¶
Safely create a deep copy of an object, handling special cases.
- Parameters:
obj – Object to copy
- Returns:
Deep copy of the object
- Raises:
DeepCopyError – If object cannot be deep copied
scrapegraphai.utils.custom_callback module¶
Custom callback for LLM token usage statistics.
This module has been taken and modified from the OpenAI callback manager in langchian-community. https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/callbacks/openai_info.py
- class scrapegraphai.utils.custom_callback.CustomCallbackHandler(llm_model_name: str)¶
Bases:
BaseCallbackHandler
Callback Handler that tracks LLMs info.
- property always_verbose: bool¶
Whether to call verbose callbacks even if verbose is False.
- completion_tokens: int = 0¶
- on_llm_end(response: LLMResult, **kwargs: Any) None ¶
Collect token usage.
- on_llm_new_token(token: str, **kwargs: Any) None ¶
Print out the token.
- on_llm_start(serialized: Dict[str, Any], prompts: List[str], **kwargs: Any) None ¶
Print out the prompts.
- prompt_tokens: int = 0¶
- successful_requests: int = 0¶
- total_cost: float = 0.0¶
- total_tokens: int = 0¶
- scrapegraphai.utils.custom_callback.get_custom_callback(llm_model_name: str)¶
Function to get custom callback for LLM token usage statistics.
- scrapegraphai.utils.custom_callback.get_token_cost_for_model(model_name: str, num_tokens: int, is_completion: bool = False) float ¶
Get the cost in USD for a given model and number of tokens.
- Parameters:
model_name – Name of the model
num_tokens – Number of tokens.
is_completion – Whether the model is used for completion or not. Defaults to False.
- Returns:
Cost in USD.
scrapegraphai.utils.data_export module¶
data_export module This module provides functions to export data to various file formats.
- scrapegraphai.utils.data_export.export_to_csv(data: List[Dict[str, Any]], filename: str) None ¶
Export data to a CSV file.
- Parameters:
data – List of dictionaries containing the data to export
filename – Name of the file to save the CSV data
- scrapegraphai.utils.data_export.export_to_json(data: List[Dict[str, Any]], filename: str) None ¶
Export data to a JSON file.
- Parameters:
data – List of dictionaries containing the data to export
filename – Name of the file to save the JSON data
- scrapegraphai.utils.data_export.export_to_xml(data: List[Dict[str, Any]], filename: str, root_element: str = 'data') None ¶
Export data to an XML file.
- Parameters:
data – List of dictionaries containing the data to export
filename – Name of the file to save the XML data
root_element – Name of the root element in the XML structure
scrapegraphai.utils.dict_content_compare module¶
This module contains utility functions for comparing the content of two dictionaries.
Functions: - normalize_dict: Recursively normalizes the values in a dictionary, converting strings to lowercase and stripping whitespace. - normalize_list: Recursively normalizes the values in a list, converting strings to lowercase and stripping whitespace. - are_content_equal: Compares two dictionaries for semantic equality after normalization.
- scrapegraphai.utils.dict_content_compare.are_content_equal(generated_result: Dict[str, Any], reference_result: Dict[str, Any]) bool ¶
Compares two dictionaries for semantic equality after normalization.
- Parameters:
generated_result (Dict[str, Any]) – The generated result dictionary.
reference_result (Dict[str, Any]) – The reference result dictionary.
- Returns:
True if the normalized dictionaries are equal, False otherwise.
- Return type:
bool
- scrapegraphai.utils.dict_content_compare.normalize_dict(d: Dict[str, Any]) Dict[str, Any] ¶
Recursively normalizes the values in a dictionary.
- Parameters:
d (Dict[str, Any]) – The dictionary to normalize.
- Returns:
A normalized dictionary with strings converted to lowercase and stripped of whitespace.
- Return type:
Dict[str, Any]
- scrapegraphai.utils.dict_content_compare.normalize_list(lst: List[Any]) List[Any] ¶
Recursively normalizes the values in a list.
- Parameters:
lst (List[Any]) – The list to normalize.
- Returns:
A normalized list with strings converted to lowercase and stripped of whitespace.
- Return type:
List[Any]
scrapegraphai.utils.llm_callback_manager module¶
This module provides a custom callback manager for LLM models.
Classes: - CustomLLMCallbackManager: Manages exclusive access to callbacks for different types of LLM models.
- class scrapegraphai.utils.llm_callback_manager.CustomLLMCallbackManager¶
Bases:
object
CustomLLMCallbackManager class provides a mechanism to acquire a callback for LLM models in an exclusive, thread-safe manner.
Attributes: _lock (threading.Lock): Ensures that only one callback can be acquired at a time.
Methods: exclusive_get_callback: A context manager that yields the appropriate callback based on the LLM model and its name, ensuring exclusive access to the callback.
- exclusive_get_callback(llm_model, llm_model_name)¶
Provides an exclusive callback for the LLM model in a thread-safe manner.
- Parameters:
llm_model – The LLM model instance (e.g., ChatOpenAI, AzureChatOpenAI, ChatBedrock).
llm_model_name (str) – The name of the LLM model, used for model-specific callbacks.
- Yields:
The appropriate callback for the LLM model, or None if the lock is unavailable.
scrapegraphai.utils.logging module¶
A centralized logging system for any library. This module provides functions to manage logging for a library. It includes functions to get and set the verbosity level, add and remove handlers, and control propagation. It also includes a function to set formatting for all handlers bound to the root logger. Source code inspired by: https://gist.github.com/DiTo97/9a0377f24236b66134eb96da1ec1693f
- scrapegraphai.utils.logging.get_logger(name: Optional[str] = None) Logger ¶
Get a logger with the specified name.
If no name is provided, the root logger for the library is returned.
- Parameters:
name (Optional[str]) – The name of the logger.
None (If) –
returned. (the root logger for the library is) –
- Returns:
The logger with the specified name.
- Return type:
logging.Logger
- scrapegraphai.utils.logging.get_verbosity() int ¶
Get the current verbosity level of the root logger for the library.
- Returns:
The current verbosity level of the root logger for the library.
- Return type:
int
- scrapegraphai.utils.logging.setDEFAULT_HANDLER() None ¶
Add the default handler to the root logger for the library.
- scrapegraphai.utils.logging.set_formatting() None ¶
Set formatting for all handlers bound to the root logger for the library.
The formatting is set to: “[levelname|filename:lineno] time >> message”
- scrapegraphai.utils.logging.set_handler(handler: Handler) None ¶
Add a handler to the root logger for the library.
- Parameters:
handler (logging.Handler) – The handler to add.
- scrapegraphai.utils.logging.set_propagation() None ¶
Enable propagation of the root logger for the library.
- scrapegraphai.utils.logging.set_verbosity(verbosity: int) None ¶
Set the verbosity level of the root logger for the library.
- Parameters:
verbosity (int) – The verbosity level to set.
- scrapegraphai.utils.logging.set_verbosity_debug() None ¶
Set the verbosity level of the root logger for the library to DEBUG.
- scrapegraphai.utils.logging.set_verbosity_error() None ¶
Set the verbosity level of the root logger for the library to ERROR.
- scrapegraphai.utils.logging.set_verbosity_fatal() None ¶
Set the verbosity level of the root logger for the library to FATAL.
- scrapegraphai.utils.logging.set_verbosity_info() None ¶
Set the verbosity level of the root logger for the library to INFO.
- scrapegraphai.utils.logging.set_verbosity_warning() None ¶
Set the verbosity level of the root logger for the library to WARNING.
- scrapegraphai.utils.logging.unsetDEFAULT_HANDLER() None ¶
Remove the default handler from the root logger for the library.
- scrapegraphai.utils.logging.unset_formatting() None ¶
Remove formatting for all handlers bound to the root logger for the library.
- scrapegraphai.utils.logging.unset_handler(handler: Handler) None ¶
Remove a handler from the root logger for the library.
- Parameters:
handler (logging.Handler) – The handler to remove.
- scrapegraphai.utils.logging.unset_propagation() None ¶
Disable propagation of the root logger for the library.
- scrapegraphai.utils.logging.warning_once(self, *args, **kwargs)¶
Emit a warning log with the same message only once.
This function is added as a method to the logging.Logger class. It emits a warning log with the same message only once, even if it is called multiple times with the same message.
- Parameters:
*args – The arguments to pass to the logging.Logger.warning method.
**kwargs – The keyword arguments to pass to the logging.Logger.warning method.
scrapegraphai.utils.model_costs module¶
Cost for 1k tokens in input
- scrapegraphai.utils.model_costs.MODEL_COST_PER_1K_TOKENS_INPUT = {'a121.ju-mid-v1': 0.0125, 'a121.ju-ultra-v1': 0.0188, 'ai21.jamba-instruct-v1:0': 0.0005, 'amazon.titan-text-express-v1': 0.0002, 'amazon.titan-text-lite-v1': 0.00015, 'amazon.titan-text-premier-v1:0': 0.0005, 'codestral': 0.0002, 'codestral-2405': 0.0002, 'cohere.command-light-text-v14': 0.0003, 'cohere.command-r-plus-v1:0': 0.003, 'cohere.command-r-v1:0': 0.0005, 'cohere.command-text-v14': 0.0015, 'meta.llama2-13b-chat-v1': 0.00075, 'meta.llama2-70b-chat-v1': 0.00195, 'meta.llama3-1-405b-instruct-v1:0': 0.00532, 'meta.llama3-1-70b-instruct-v1:0': 0.00099, 'meta.llama3-1-8b-instruct-v1:0': 0.00022, 'meta.llama3-70b-instruct-v1:0': 0.00265, 'meta.llama3-8b-instruct-v1:0': 0.0003, 'mistral-large': 0.002, 'mistral-large-2407': 0.002, 'mistral-medium-latest': 0.00275, 'mistral-small': 0.0002, 'mistral-small-2409': 0.0002, 'mistral-small-latest': 0.001, 'mistral.mistral-7b-instruct-v0:2': 0.00015, 'mistral.mistral-large-2402-v1:0': 0.004, 'mistral.mistral-large-2407-v1:0': 0.002, 'mistral.mistral-small-2402-v1:0': 0.001, 'mistral.mixtral-7x8b-instruct-v0:1': 0.00045, 'open-mistral-7b': 0.00025, 'open-mistral-nemo': 0.00015, 'open-mistral-nemo-2407': 0.00015, 'open-mixtral-8x22b': 0.002, 'open-mixtral-8x7b': 0.0007, 'pixtral-12b': 0.00015, 'pixtral-12b-2409': 0.00015}¶
Cost for 1k tokens in output
scrapegraphai.utils.output_parser module¶
scrapegraphai.utils.parse_state_keys module¶
Parse_state_key module
- scrapegraphai.utils.parse_state_keys.parse_expression(expression, state: dict) list ¶
Parses a complex boolean expression involving state keys.
- Parameters:
expression (str) – The boolean expression to parse.
state (dict) – Dictionary of state keys used to evaluate the expression.
- Raises:
ValueError – If the expression is empty, has adjacent state keys without operators,
invalid operator usage, unbalanced parentheses, or if no state keys match the expression. –
- Returns:
A list of state keys that match the boolean expression, ensuring each key appears only once.
- Return type:
list
Example
>>> parse_expression("user_input & (relevant_chunks | parsed_document | document)", {"user_input": None, "document": None, "parsed_document": None, "relevant_chunks": None}) ['user_input', 'relevant_chunks', 'parsed_document', 'document']
This function evaluates the expression to determine the logical inclusion of state keys based on provided boolean logic. It checks for syntax errors such as unbalanced parentheses, incorrect adjacency of operators, and empty expressions.
scrapegraphai.utils.prettify_exec_info module¶
Prettify the execution information of the graph.
- scrapegraphai.utils.prettify_exec_info.prettify_exec_info(complete_result: list[dict]) DataFrame ¶
Transforms the execution information of a graph into a DataFrame for enhanced visualization.
- Parameters:
complete_result (list[dict]) – The complete execution information of the graph.
- Returns:
A DataFrame that organizes the execution information for better readability and analysis.
- Return type:
pd.DataFrame
Example
>>> prettify_exec_info([{'node': 'A', 'status': 'success'}, {'node': 'B', 'status': 'failure'}]) DataFrame with columns 'node' and 'status' showing execution results for each node.
scrapegraphai.utils.proxy_rotation module¶
Module for rotating proxies
- class scrapegraphai.utils.proxy_rotation.Proxy¶
Bases:
dict
proxy server information
- bypass: str¶
- criteria: ProxyBrokerCriteria¶
- password: str¶
- server: str¶
- username: str¶
- class scrapegraphai.utils.proxy_rotation.ProxyBrokerCriteria¶
Bases:
TypedDict
proxy broker criteria
- anonymous: bool¶
- countryset: Set[str]¶
- search_outside_if_empty: bool¶
- secure: bool¶
- timeout: float¶
- class scrapegraphai.utils.proxy_rotation.ProxySettings¶
Bases:
TypedDict
proxy settings
- bypass: str¶
- password: str¶
- server: str¶
- username: str¶
- scrapegraphai.utils.proxy_rotation.is_ipv4_address(address: str) bool ¶
If a proxy address conforms to a IPv4 address
- scrapegraphai.utils.proxy_rotation.parse_or_search_proxy(proxy: Proxy) ProxySettings ¶
parses a proxy configuration or searches for a new one matching the specified broker criteria
- Parameters:
proxy – The proxy configuration to parse or search for.
- Returns:
A ‘playwright’ compliant proxy configuration.
Notes
If the proxy server is a IP address, it is assumed to be
a proxy server address. - If the proxy server is ‘broker’, a proxy server is searched for based on the provided broker criteria.
Example
>>> proxy = { ... "server": "broker", ... "criteria": { ... "anonymous": True, ... "countryset": {"GB", "US"}, ... "secure": True, ... "timeout": 5.0 ... "search_outside_if_empty": False ... } ... }
>>> parse_or_search_proxy(proxy) { "server": "<proxy-server-matching-criteria>", }
Example
>>> proxy = { ... "server": "192.168.1.1:8080", ... "username": "<username>", ... "password": "<password>" ... }
>>> parse_or_search_proxy(proxy) { "server": "192.168.1.1:8080", "username": "<username>", "password": "<password>" }
- scrapegraphai.utils.proxy_rotation.search_proxy_servers(anonymous: bool = True, countryset: Optional[Set[str]] = None, secure: bool = False, timeout: float = 5.0, max_shape: int = 5, search_outside_if_empty: bool = True) List[str] ¶
search for proxy servers that match the specified broker criteria
- Parameters:
anonymous – whether proxy servers should have minimum level-1 anonymity.
countryset – admissible proxy servers locations.
secure – whether proxy servers should support HTTP or HTTPS; defaults to HTTP;
timeout – The maximum timeout for proxy responses; defaults to 5.0 seconds.
max_shape – The maximum number of proxy servers to return; defaults to 5.
search_outside_if_empty – whether countryset should be extended if empty.
- Returns:
A list of proxy server URLs matching the criteria.
Example
>>> search_proxy_servers( ... anonymous=True, ... countryset={"GB", "US"}, ... secure=True, ... timeout=1.0 ... max_shape=2 ... ) [ "http://103.10.63.135:8080", "http://113.20.31.250:8080", ]
scrapegraphai.utils.research_web module¶
scrapegraphai.utils.save_audio_from_bytes module¶
This utility function saves the byte response as an audio file.
- scrapegraphai.utils.save_audio_from_bytes.save_audio_from_bytes(byte_response: bytes, output_path: Union[str, Path]) None ¶
Saves the byte response as an audio file to the specified path.
- Parameters:
byte_response (bytes) – The byte array containing audio data.
output_path (Union[str, Path]) – The destination
saved. (file path where the audio file will be) –
Example
>>> save_audio_from_bytes(b'audio data', 'path/to/audio.mp3')
This function writes the byte array containing audio data to a file, saving it as an audio file.
scrapegraphai.utils.save_code_to_file module¶
save_code_to_file module
- scrapegraphai.utils.save_code_to_file.save_code_to_file(code: str, filename: str) None ¶
Saves the generated code to a Python file.
- Parameters:
code (str) – The generated code to be saved.
filename (str) – name of the output file
scrapegraphai.utils.schema_trasform module¶
This utility function trasfrom the pydantic schema into a more comprehensible schema.
- scrapegraphai.utils.schema_trasform.transform_schema(pydantic_schema)¶
Transform the pydantic schema into a more comprehensible JSON schema.
- Parameters:
pydantic_schema (dict) – The pydantic schema.
- Returns:
The transformed JSON schema.
- Return type:
dict
scrapegraphai.utils.split_text_into_chunks module¶
split_text_into_chunks module
- scrapegraphai.utils.split_text_into_chunks.split_text_into_chunks(text: str, chunk_size: int, model: BaseChatModel, use_semchunk=True) List[str] ¶
Splits the text into chunks based on the number of tokens.
- Parameters:
text (str) – The text to split.
chunk_size (int) – The maximum number of tokens per chunk.
- Returns:
A list of text chunks.
- Return type:
List[str]
scrapegraphai.utils.sys_dynamic_import module¶
high-level module for dynamic importing of python modules at runtime
source code inspired by https://gist.github.com/DiTo97/46f4b733396b8d7a8f1d4d22db902cfc
- scrapegraphai.utils.sys_dynamic_import.dynamic_import(modname: str, message: str = '') None ¶
imports a python module at runtime
- Parameters:
modname – The module name in the scope
message – The display message in case of error
- Raises:
ImportError – If the module cannot be imported at runtime
- scrapegraphai.utils.sys_dynamic_import.srcfile_import(modpath: str, modname: str) types.ModuleType ¶
imports a python module from its srcfile
- Parameters:
modpath – The srcfile absolute path
modname – The module name in the scope
- Returns:
The imported module
- Raises:
ImportError – If the module cannot be imported from the srcfile
scrapegraphai.utils.tokenizer module¶
Module for counting tokens and splitting text into chunks
- scrapegraphai.utils.tokenizer.num_tokens_calculus(string: str, llm_model: BaseChatModel) int ¶
Returns the number of tokens in a text string.
Module contents¶
__init__.py file for utils folder