Example: Project Alpha - Web Page Link Scraping¶
This example demonstrates the web_page_link_scraping category within the ingestion stage of the dept_project_alpha department. It showcases fetching HTML from a web page, extracting links, identifying the canonical URL, and creating a summary manifest (local JSON file) and a Prefect UI artifact.
This category illustrates how to handle unstructured web data and extract meaningful information like canonical links for deduplication or tracking purposes.
Key Features Demonstrated¶
- Fetching HTML content from a web page using
httpxvia core logic functions. - Parsing HTML using
BeautifulSoupto extract links. - Identifying the canonical URL (
<link rel="canonical">) from the HTML. - Hashing the canonical URL (using
hashlib) for potential deduplication or as a unique identifier for the page's primary representation. - Saving a local JSON manifest file with scrape metadata (original URL, canonical URL, its hash, extracted links).
- Creating a Prefect Markdown artifact to display the scraped information in the UI.
- Usage of department-specific Prefect task wrappers calling functions from
airnub_prefect_starter/core/.
File Structure¶
The relevant files for this example include:
- Category Flow:
flows/dept_project_alpha/ingestion/web_page_link_scraping/ingest_web_page_link_scraping_flow_dept_project_alpha.py
- Task Wrappers (Department-Specific):
flows/dept_project_alpha/ingestion/web_page_link_scraping/tasks/fetch_html_content_of_web_page_task_dept_project_alpha.pyflows/dept_project_alpha/ingestion/web_page_link_scraping/tasks/extract_links_from_html_task_dept_project_alpha.py- (A new task wrapper, e.g.)
create_scraped_page_manifest_and_artifact_task_dept_project_alpha.py
- Category Configuration (YAML for Prefect Variable):
configs/variables/dept_project_alpha/ingestion/web_page_link_scraping/ingest_web_page_link_scraping_config_dept_project_alpha.yaml
- Core Logic (in
airnub_prefect_starterpackage):airnub_prefect_starter/core/web_utils.py(containingcore_fetch_html_contentandcore_parse_links_and_canonical_from_html)airnub_prefect_starter/common/utils.py(e.g.,generate_sha256_hash_from_string)airnub_prefect_starter/core/artifact_creators.py(containingsave_scraped_page_manifest_locallyandcore_create_scraped_page_artifact)airnub_prefect_starter/core/manifest_models.py(e.g.,DemoScrapedPageManifestEntryPydantic model)
Configuration Example¶
The ingest_web_page_link_scraping_config_dept_project_alpha.yaml file defines parameters for this category. This YAML content becomes part of the dept_project_alpha_ingestion_web_page_link_scraping_config Prefect Variable.
An example configuration:
# configs/variables/dept_project_alpha/ingestion/web_page_link_scraping/ingest_web_page_link_scraping_config_dept_project_alpha.yaml
data_source_name: "project_alpha_web_pages_demo"
# Base path inside worker for storing local JSON manifests for scraped pages
local_manifests_storage_base: "/app/local_demo_artifacts/web_scrape_manifests"
pages_to_scrape:
- name: "Prefect Blog Main Page"
url: "[https://www.prefect.io/blog/](https://www.prefect.io/blog/)" # Example: A page likely to have a canonical URL and links
- name: "Python Org About Page"
url: "[https://www.python.org/about/](https://www.python.org/about/)"
Flow Logic Overview¶
The ingest_web_page_link_scraping_flow_dept_project_alpha.py (the category flow for "Web Page Link Scraping" within "Project Alpha's" ingestion stage) typically orchestrates the following steps for each configured page URL:
-
Load Configuration:
- The flow receives its specific configuration dictionary as a parameter from its parent stage flow (
ingestion_flow_dept_project_alpha.py). This configuration originates from thedept_project_alpha_ingestion_web_page_link_scraping_configPrefect Variable (which was populated fromconfigs/variables/dept_project_alpha/ingestion/web_page_link_scraping/ingest_web_page_link_scraping_config_dept_project_alpha.yaml). - This config typically includes:
data_source_name: An identifier for this specific data source (e.g.,project_alpha_web_pages_demo).local_manifests_storage_base: The base path within the worker container where local JSON manifests for scraped pages will be stored (e.g.,/app/local_demo_artifacts/web_scrape_manifests). This path is often derived fromconfig_ds.DATA_DIRwithin the core logic.pages_to_scrape: A list of dictionaries, each specifying a page'sname(for logging/display) andurlto be scraped.
- The flow receives its specific configuration dictionary as a parameter from its parent stage flow (
-
Iterate and Process Web Pages: The flow loops through each page definition in the
pages_to_scrapelist. For each pageurl:- Fetch HTML: It calls the department-specific task wrapper
Workspace_html_content_of_web_page_task_dept_project_alpha.- This task wrapper, located in
flows/dept_project_alpha/ingestion/web_page_link_scraping/tasks/, calls the core logic functioncore_fetch_html_content(fromairnub_prefect_starter/core/web_utils.py) to retrieve the raw HTML content of the page.
- This task wrapper, located in
- Extract Links & Canonical URL: If HTML content is successfully fetched, it calls the department-specific task wrapper
extract_links_from_html_task_dept_project_alpha.- This task wrapper calls the core logic function
core_parse_links_and_canonical_from_html(also fromairnub_prefect_starter/core/web_utils.py). - This core function parses the HTML (using
BeautifulSoup), identifies the canonical URL (if present), hashes the canonical URL string (usinggenerate_sha256_hash_from_stringfromcommon/utils.py), and extracts all valid HTTP/HTTPS links from<a>tags. - It returns a dictionary (
scrape_result) containing the original URL, the found canonical URL, the hash of the canonical URL, and the list of extracted links.
- This task wrapper calls the core logic function
- Create Local Manifest and Prefect UI Artifact: The flow then typically calls another department-specific task (e.g.,
create_scraped_page_manifest_and_artifact_task_dept_project_alpha).- This task wrapper would orchestrate calls to:
save_scraped_page_manifest_locally(fromairnub_prefect_starter/core/artifact_creators.pyorweb_utils.py): This function takes thescrape_resultdictionary, populates aDemoScrapedPageManifestEntryPydantic model (fromcore/manifest_models.py), serializes it to JSON, and saves it to a local file within the worker. The storage path for this JSON manifest is structured using thedata_source_nameand thecanonical_url_hash(e.g.,<local_manifests_storage_base>/<data_source_name>/scraped_pages_manifests/<canonical_url_hash>/<original_url_slugified>.manifest.json).core_create_scraped_page_artifact(fromairnub_prefect_starter/core/artifact_creators.py): This function uses thescrape_resultdictionary and the path to the saved local JSON manifest to generate a Prefect Markdown artifact in the UI.
- This task wrapper would orchestrate calls to:
- Fetch HTML: It calls the department-specific task wrapper
-
Core Logic Functions (highlights):
airnub_prefect_starter/core/web_utils.py:core_fetch_html_content: Useshttpxto get the page's HTML content.core_parse_links_and_canonical_from_html: EmploysBeautifulSoupfor parsing HTML,urljoinfor resolving relative links, and calls a hashing utility for the canonical URL.
airnub_prefect_starter/common/utils.py:generate_sha256_hash_from_string: Hashes the canonical URL string.sanitize_filename: Used for creating safe filenames for local manifests.
airnub_prefect_starter/core/artifact_creators.py:save_scraped_page_manifest_locally: Handles the creation and saving of the local JSON manifest file based on theDemoScrapedPageManifestEntrymodel.core_create_scraped_page_artifact: Takes the structured scrape data (including the path to the local JSON manifest) and creates an informative Prefect Markdown artifact.
Expected Output¶
When this "Web Page Link Scraping" category flow for "Project Alpha" runs successfully for a configured web page:
- Logs: Detailed logs in the Prefect UI and worker console, showing the URL being scraped, the canonical URL identified (and its hash), and a summary of the number of links extracted.
- Local JSON Manifest File (in Worker):
- A JSON file (e.g.,
<original_url_slugified>.manifest.json) will be created in a path within the worker container, structured by data source and canonical URL hash. For example:/app/local_demo_artifacts/web_scrape_manifests/project_alpha_web_pages/scraped_pages_manifests/<CANONICAL_URL_HASH>/ - This JSON file contains structured metadata about the scrape: original URL, canonical URL (and its hash), scrape timestamp, and the full list of extracted links, based on the
DemoScrapedPageManifestEntryPydantic model. - (If you have volume-mapped a host directory like
./data/project_alpha_demo_outputs/web_scrape_manifests/to the worker's/app/local_demo_artifacts/web_scrape_manifests/viadocker-compose.yml, you can browse these manifest files on your host.)
- A JSON file (e.g.,
- Prefect UI Artifact:
- A Markdown artifact will appear on the flow run's page in the Prefect UI for each scraped page.
- This artifact will summarize the key details: original URL, canonical URL, canonical URL hash, a preview of the extracted links, and the worker path to its detailed local JSON manifest file.
This example illustrates a basic web scraping pipeline, focusing on extracting links and canonical URL information for potential deduplication or tracking. It demonstrates how to manifest this information both locally as a structured JSON file and within the Prefect UI for enhanced observability.