Blog Robótica & RL Dados & Embeddings

Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export

In this tutorial, we build a complete Crawlee for Python workflow from setup to AI-ready output. We generate a local demo website, then crawl it with BeautifulSoupCrawler, ParselCrawler, and PlaywrightCrawler. We extract titles, metadata, product fields, and JavaScript-rendered cards, and capture full-page screenshots. We then normalize the data, build a link graph, and export JSON, CSV, and RAG-ready JSONL chunks. The post Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, ...

MarkTechPost ·Sana Hassan · 21 de janeiro de 2026

In this tutorial, we build a full Crawlee-for-Python workflow that covers environment setup, local website generation, static crawling, dynamic crawling, structured extraction, and downstream data processing. We begin by configuring a compatible Crawlee runtime with pinned Pydantic support, Playwright browser installation, persistent storage directories, and Colab-safe execution handling. We then generate a realistic local demo website containing product pages, documentation pages, blog content, internal links, robots.txt rules, JSON-LD metadata, and JavaScript-rendered catalog items. Using BeautifulSoupCrawler, we perform fast recursive HTML crawling and extract page titles, metadata, text previews, outgoing links, product attributes, documentation headings, code blocks, and blog tags. With ParselCrawler, we run precise CSS- and XPath-based extraction on product detail pages. With PlaywrightCrawler, we render JavaScript content in a headless Chromium browser, wait for dynamic DOM elements to appear, extract client-side data, and capture full-page screenshots.

Setting Up the Crawlee Python Runtime and Helpers

Copy CodeCopiedUse a different Browser

import json

import time

import math

import shutil

import socket

import hashlib

import asyncio

import textwrap

import subprocess

import threading

from pathlib import Path

from functools import partial

from http.server import ThreadingHTTPServer, SimpleHTTPRequestHandler

from importlib.metadata import version, PackageNotFoundError

SETUP_SENTINEL = "/content/.crawlee_python_tutorial_setup_done_v2"

def sh(command, check=True, quiet=False):

print(f"\n$ {command}")

result = subprocess.run(

shell=True,

stdout=subprocess.PIPE,

stderr=subprocess.STDOUT,

if not quiet and result.stdout:

print(result.stdout[-5000:])

if check and result.returncode != 0:

raise RuntimeError(f"Command failed with exit code {result.returncode}: {command}")

return result.returncode == 0

def package_version(package_name):

return version(package_name)

except PackageNotFoundError:

return None

def is_good_pydantic_version(v):

return False

m = re.match(r"^(\d+)\.(\d+)", v)

return False

major, minor = int(m.group(1)), int(m.group(2))

return major == 2 and minor == 11

current_crawlee = package_version("crawlee")

current_pydantic = package_version("pydantic")

needs_setup = (

not os.path.exists(SETUP_SENTINEL)

or current_crawlee is None

or not is_good_pydantic_version(current_pydantic)

if needs_setup:

print("PHASE 1: Installing compatible Crawlee + Pydantic + Playwright dependencies.")

print("After this finishes, Colab will restart automatically. Then run this same cell again.")

sh(f'{sys.executable} -m pip uninstall -y crawlee pydantic pydantic-core', check=False)

f'{sys.executable} -m pip install -q -U '

f'"pydantic>=2.11,<2.12" '

f'"crawlee[all]" '

f'pandas matplotlib networkx nest_asyncio beautifulsoup4 parsel'

sh(f'{sys.executable} -m playwright install --with-deps chromium', check=False)

Path(SETUP_SENTINEL).write_text("done", encoding="utf-8")

print("\nInstalled versions:")

sh(f'{sys.executable} -m pip show crawlee pydantic pydantic-core', check=False)

import google.colab

print("\nRestarting Colab runtime now. After it reconnects, run this same cell again.")

os.kill(os.getpid(), 9)

except Exception:

raise SystemExit("Setup complete. Restart the runtime/kernel manually, then run this cell again.")

print("PHASE 2: Dependencies are ready. Running the Crawlee tutorial.")

import pandas as pd

import matplotlib.pyplot as plt

import networkx as nx

import nest_asyncio

nest_asyncio.apply()

TUTORIAL_ROOT = Path("/content/crawlee_python_advanced_tutorial")

SITE_DIR = TUTORIAL_ROOT / "demo_site"

OUTPUT_DIR = TUTORIAL_ROOT / "outputs"

STORAGE_DIR = TUTORIAL_ROOT / "crawlee_storage"

SCREENSHOT_DIR = OUTPUT_DIR / "screenshots"

for path in [SITE_DIR, OUTPUT_DIR, STORAGE_DIR]:

if path.exists():

shutil.rmtree(path)

for path in [SITE_DIR, OUTPUT_DIR, STORAGE_DIR, SCREENSHOT_DIR]:

path.mkdir(parents=True, exist_ok=True)

os.environ["CRAWLEE_STORAGE_DIR"] = str(STORAGE_DIR)

os.environ["CRAWLEE_LOG_LEVEL"] = "INFO"

os.environ["CRAWLEE_PURGE_ON_START"] = "true"

from crawlee import Glob, ConcurrencySettings

from crawlee.crawlers import (

BeautifulSoupCrawler,

BeautifulSoupCrawlingContext,

ParselCrawler,

ParselCrawlingContext,

PlaywrightCrawler,

PlaywrightCrawlingContext,

import crawlee

print("Crawlee version:", crawlee.__version__)

except Exception:

print("Crawlee imported successfully.")

print("Pydantic version:", package_version("pydantic"))

def safe_slug(value):

value = re.sub(r"[^a-zA-Z0-9]+", "-", str(value)).strip("-").lower()

return value or "item"

def money_to_float(value):

if value is None:

return None

cleaned = re.sub(r"[^0-9.]", "", str(value))

return float(cleaned) if cleaned else None

def normalize_text(value, max_len=None):

value = re.sub(r"\s+", " ", value or "").strip()

return value[:max_len] if max_len else value

def write_file(path, content):

path = Path(path)

path.parent.mkdir(parents=True, exist_ok=True)

path.write_text(textwrap.dedent(content).strip() + "\n", encoding="utf-8")

We begin by preparing the complete Colab runtime for the Crawlee tutorial. We install compatible versions of Crawlee, Pydantic, Playwright, and the required analysis libraries, and handle the automatic restart required after setup. We then configure storage folders, environment variables, crawler imports, and helper functions to ensure the rest of the workflow runs smoothly.

Generating the Demo Website and Product Catalog

Copy CodeCopiedUse a different Browser

PRODUCTS = [

"sku": "CRW-101",

"name": "Crawler Reliability Kit",

"category": "automation",

"price": 149.0,

"rating": 4.8,

"stock": 18,

"features": ["retry policy", "queue replay", "structured logs"],

"related": ["CRW-202", "CRW-303"],

"sku": "CRW-202",

"name": "Playwright Rendering Pack",

"category": "browser",

"price": 249.0,

"rating": 4.7,

"stock": 9,

"features": ["headless chromium", "screenshots", "dynamic DOM extraction"],

"related": ["CRW-101", "CRW-404"],

"sku": "CRW-303",

"name": "RAG Extraction Bundle",

"category": "ai-data",

"price": 199.0,

"rating": 4.9,

"stock": 13,

"features": ["clean text chunks", "metadata capture", "JSONL export"],

"related": ["CRW-101", "CRW-505"],

"sku": "CRW-404",

"name": "Anti-Fragile Session Toolkit",

"category": "resilience",

"price": 299.0,

"rating": 4.6,

"stock": 5,

"features": ["session rotation", "state recovery", "graceful failures"],

"related": ["CRW-202", "CRW-505"],

"sku": "CRW-505",

"name": "Data Export Control Plane",

"category": "storage",

"price": 179.0,

"rating": 4.5,

"stock": 21,

"features": ["datasets", "key-value store", "CSV and JSON export"],

"related": ["CRW-303", "CRW-404"],

def layout(title, body, extra_head="", extra_script=""):

font-family: Inter, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;

background: #f7f7fb;

color: #1f2430;

background: #202638;

color: white;

padding: 28px 40px;

color: #dbe7ff;

margin-right: 18px;

text-decoration: none;

font-weight: 600;

max-width: 1050px;

margin: 0 auto;

padding: 32px;

display: grid;

grid-template-columns: repeat(auto-fit, minmax(230px, 1fr));

.card, article, .panel {

background: white;

border: 1px solid #e5e7ef;

border-radius: 16px;

padding: 20px;

box-shadow: 0 8px 25px rgba(20, 30, 60, 0.05);

font-size: 1.3rem;

font-weight: 800;

display: inline-block;

background: #edf2ff;

border: 1px solid #d6e0ff;

border-radius: 999px;

padding: 4px 10px;

margin: 3px;

font-size: 0.82rem;

.stock-low {

color: #b42318;

font-weight: 700;

.stock-ok {

color: #067647;

font-weight: 700;

code, pre {

background: #111827;

color: #d1fae5;

border-radius: 10px;

padding: 16px;

overflow-x: auto;

padding: 30px 40px;

color: #606779;

return f"""

{title}

{extra_head}

{title}

{body}

{extra_script}

def build_demo_site():

write_file(

SITE_DIR / "robots.txt",

User-agent: *

Disallow: /admin/

product_cards = []

for product in PRODUCTS:

product_cards.append(

{product['name']}

{product['category']} crawler module with rating {product['rating']}.

${product['price']:.2f}

Stock: {product['stock']}

write_file(

SITE_DIR / "index.html",

"Crawlee Demo Commerce + Docs Hub",

Why this site exists

This local website gives us predictable pages for testing Crawlee without scraping a third-party website.

We include static HTML pages, documentation pages, product detail pages, a blog article, robots.txt,

and a JavaScript-rendered page.

Featured crawler modules

{''.join(product_cards)}

Internal links for recursive crawling

Getting started guide

Advanced routing guide

Crawling at scale article

JavaScript-rendered catalog

Admin page blocked by robots and crawler filters

for product in PRODUCTS:

related_links = "\n".join(

for sku in product["related"]

feature_list = "\n".join(f"

{feature}

" for feature in product["features"])

json_ld = json.dumps(

"@context": "https://schema.org",

"@type": "Product",

"sku": product["sku"],

"name": product["name"],

"category": product["category"],

"offers": {

"@type": "Offer",

"price": product["price"],

"priceCurrency": "USD",

"aggregateRating": {

"@type": "AggregateRating",

"ratingValue": product["rating"],

write_file(

SITE_DIR / "products" / f"product-{safe_slug(product['sku'])}.html",

f"{product['name']} | Product Detail",

data-sku="{product['sku']}"

data-category="{product['category']}"

data-rating="{product['rating']}"

data-stock="{product['stock']}">

{product['name']}

SKU: {product['sku']}

Category: {product['category']}

${product['price']:.2f}

Rating: {product['rating']} / 5

Stock: {product['stock']}

Features

{feature_list}

Related modules

{related_links}

We create a realistic product catalog that becomes the structured data source for our demo website. We define reusable HTML layout logic, styling, navigation, and page templates to make the local website look and behave like a small commercial and documentation portal. We then generate the homepage and product detail pages, including prices, ratings, stock levels, product features, related links, and JSON-LD metadata.

Adding Docs, Blog, Dynamic, and Admin Pages

Copy CodeCopiedUse a different Browser

write_file(

SITE_DIR / "docs" / "getting-started.html",

"Getting Started with Reliable Crawlers",

HTTP-first crawling strategy

We start with HTTP crawlers because they are lightweight and efficient.

Browser crawling is reserved for pages that need JavaScript rendering.

Core extraction fields

Each crawler extracts URL, title, page type, text summary, outgoing links, and page-specific metadata.

crawler = BeautifulSoupCrawler(max_requests_per_crawl=20)

Next: advanced routing

write_file(

SITE_DIR / "docs" / "advanced-routing.html",

"Advanced Routing and Storage",

Queue filtering

We filter links to keep the crawl focused on the same local domain and skip admin pages.

Storage design

Structured rows go to datasets. Binary screenshots and snapshots go to a key-value store.

await context.enqueue_links(include=[Glob("https://example.com/**")])

Read the scaling article

write_file(

SITE_DIR / "blog" / "crawling-at-scale.html",

"Crawling at Scale",

Scaling crawler jobs without losing reliability

Production crawlers need controlled concurrency, retry behavior, stable request queues,

structured exports, and monitoring-ready output.

For AI data workflows, we also normalize text, preserve source URLs, create chunks,

and record extraction provenance.

queues

datasets

rag

playwright

dynamic_items = json.dumps(

"sku": "JS-900",

"name": "Dynamic Inventory Scanner",

"price": 329.0,

"stock": 4,

"desc": "Rendered only after JavaScript executes.",

"sku": "JS-901",

"name": "Client-Side Review Miner",

"price": 279.0,

"stock": 11,

"desc": "Created by browser-side DOM manipulation.",

"sku": "JS-902",

"name": "Async Catalog Watcher",

"price": 389.0,

"stock": 7,

"desc": "Useful for testing PlaywrightCrawler extraction.",

dynamic_script = f"""

const dynamicItems = {dynamic_items};

function renderItems() {{

const root = document.querySelector("#dynamic-products");

root.innerHTML = "";

for (const item of dynamicItems) {{

const card = document.createElement("div");

card.className = "card js-card";

card.dataset.sku = item.sku;

card.dataset.price = item.price;

card.dataset.stock = item.stock;

card.innerHTML = `

${{item.name}}

${{item.desc}}

$${{item.price.toFixed(2)}}

Stock: ${{item.stock}}

root.appendChild(card);

document.querySelector("#render-status").textContent =

"Rendered " + dynamicItems.length + " JavaScript items.";

setTimeout(renderItems, 600);

write_file(

SITE_DIR / "dynamic.html",

"JavaScript Rendered Catalog",

Dynamic content test

A plain HTTP crawler can download this page, but it will not see the cards below until JavaScript runs.

PlaywrightCrawler opens a real browser and extracts the rendered DOM.

Waiting for JavaScript rendering...

extra_script=dynamic_script,

write_file(

SITE_DIR / "admin" / "hidden.html",

"Hidden Admin Page",

This page should be skipped

The crawler excludes this admin path to demonstrate control over the rawl scope

build_demo_site()

print(f"Demo site generated at: {SITE_DIR}")

class QuietHandler(SimpleHTTPRequestHandler):

def log_message(self, format, *args):

def start_local_server(directory):

probe = socket.socket()

probe.bind(("127.0.0.1", 0))

port = probe.getsockname()[1]

probe.close()

handler = partial(QuietHandler, directory=str(directory))

httpd = ThreadingHTTPServer(("127.0.0.1", port), handler)

thread = threading.Thread(target=httpd.serve_forever, daemon=True)

thread.start()

base_url = f"http://127.0.0.1:{port}"

time.sleep(0.5)

return httpd, base_url

def extract_json_ld(soup):

blocks = []

for script in soup.select('script[type="application/ld+json"]'):

raw = script.string or script.get_text()

if not raw:

blocks.append(json.loads(raw))

except Exception:

blocks.append({"raw": raw})

return blocks

def write_json(path, rows):

path = Path(path)

path.write_text(json.dumps(rows, ensure_ascii=False, indent=2), encoding="utf-8")

def write_csv(path, rows):

path = Path(path)

if not rows:

path.write_text("", encoding="utf-8")

flattened = []

for row in rows:

for key, value in row.items():

if isinstance(value, (list, dict)):

flat[key] = json.dumps(value, ensure_ascii=False)

flat[key] = value

flattened.append(flat)

fieldnames = sorted({key for row in flattened for key in row.keys()})

with path.open("w", newline="", encoding="utf-8") as f:

writer = csv.DictWriter(f, fieldnames=fieldnames)

writer.writeheader()

writer.writerows(flattened)

We expand the demo website by adding documentation pages, a blog article, a JavaScript-rendered catalog page, and an admin page intended to be excluded from crawling. We use these pages to test different crawling scenarios, including static HTML extraction, documentation parsing, blog metadata extraction, dynamic browser rendering, and crawl filtering. We also start a local HTTP server and define utilities to extract JSON-LD content and export crawl results to JSON and CSV.

Static Crawling with BeautifulSoupCrawler and ParselCrawler

Copy CodeCopiedUse a different Browser

async def run_beautifulsoup_crawl(base_url):

print("\n=== 1) BeautifulSoupCrawler: fast recursive HTTP crawl ===")

crawler = BeautifulSoupCrawler(

parser="html.parser",

max_requests_per_crawl=30,

max_request_retries=1,

respect_robots_txt_file=True,

concurrency_settings=ConcurrencySettings(

desired_concurrency=4,

max_concurrency=6,

@crawler.router.default_handler

async def request_handler(context: BeautifulSoupCrawlingContext) -> None:

soup = context.soup

url = context.request.url

title = normalize_text(soup.title.get_text(" ", strip=True) if soup.title else "")

meta_description = ""

meta_tag = soup.find("meta", attrs={"name": "description"})

if meta_tag:

meta_description = normalize_text(meta_tag.get("content", ""))

out_links = []

for a in soup.select("a[href]"):

href = a.get("href")

label = normalize_text(a.get_text(" ", strip=True), 120)

out_links.append({"href": href, "label": label})

page_text = normalize_text(soup.get_text(" ", strip=True), 1000)

if "/products/" in url:

page_type = "product"

elif "/docs/" in url:

page_type = "documentation"

elif "/blog/" in url:

page_type = "blog"

elif "/dynamic" in url:

page_type = "dynamic-shell"

page_type = "index"

"source": "beautifulsoup-http",

"url": url,

"title": title,

"page_type": page_type,

"meta_description": meta_description,

"text_preview": page_text,

"out_links": out_links,

"json_ld": extract_json_ld(soup),

"extracted_at_unix": time.time(),

if page_type == "product":

article = soup.select_one("article.product")

if article:

price_node = soup.select_one(".price")

row["product"] = {

"sku": article.get("data-sku"),

"category": article.get("data-category"),

"name": normalize_text(

soup.select_one(".product-title").get_text(" ", strip=True)

if soup.select_one(".product-title")

"price": money_to_float(price_node.get("data-price") if price_node else None),

"rating": float(article.get("data-rating")) if article.get("data-rating") else None,

"stock": int(article.get("data-stock")) if article.get("data-stock") else None,

"features": [

normalize_text(li.get_text(" ", strip=True))

for li in soup.select(".features li")

if page_type == "documentation":

row["doc"] = {

"headings": [

normalize_text(h.get_text(" ", strip=True))

for h in soup.select("h2, h3")

"code_blocks": [

normalize_text(code.get_text(" ", strip=True))

for code in soup.select("pre code")

if page_type == "blog":

row["blog"] = {

"author": soup.select_one(".blog-post").get("data-author") if soup.select_one(".blog-post") else None,

"reading_time": soup.select_one(".blog-post").get("data-reading-time") if soup.select_one(".blog-post") else None,

normalize_text(tag.get_text(" ", strip=True))

for tag in soup.select(".tag")

rows.append(row)

await context.push_data(row)

await context.enqueue_links(

include=[Glob(f"{base_url}/**")],

Glob(f"{base_url}/admin/**"),

Glob(f"{base_url}/dynamic.html"),

await crawler.run([f"{base_url}/index.html"])

write_json(OUTPUT_DIR / "beautifulsoup_crawl.json", rows)

write_csv(OUTPUT_DIR / "beautifulsoup_crawl.csv", rows)

print(f"BeautifulSoup rows extracted: {len(rows)}")

return rows

async def run_parsel_precision_crawl(base_url):

print("\n=== 2) ParselCrawler: precise CSS/XPath extraction from product pages ===")

product_urls = [

f"{base_url}/products/product-{safe_slug(product['sku'])}.html"

for product in PRODUCTS

crawler = ParselCrawler(

max_requests_per_crawl=len(product_urls),

max_request_retries=1,

concurrency_settings=ConcurrencySettings(

desired_concurrency=5,

max_concurrency=8,

@crawler.router.default_handler

async def request_handler(context: ParselCrawlingContext) -> None:

selector = context.selector

title = selector.css("title::text").get()

sku = selector.css("article.product::attr(data-sku)").get()

category = selector.css("article.product::attr(data-category)").get()

rating = selector.css("article.product::attr(data-rating)").get()

stock = selector.css("article.product::attr(data-stock)").get()

name = selector.css(".product-title::text").get()

price = selector.css(".price::attr(data-price)").get()

features = [

normalize_text(feature)

for feature in selector.css(".features li::text").getall()

"source": "parsel-precision",

"url": context.request.url,

"title": normalize_text(title),

"sku": sku,

"name": normalize_text(name),

"category": category,

"price": money_to_float(price),

"rating": float(rating) if rating else None,

"stock": int(stock) if stock else None,

"features": features,

"xpath_title": normalize_text(selector.xpath("//title/text()").get()),

rows.append(row)

await context.push_data(row)

await crawler.run(product_urls)

write_json(OUTPUT_DIR / "parsel_products.json", rows)

write_csv(OUTPUT_DIR / "parsel_products.csv", rows)

print(f"Parsel product rows extracted: {len(rows)}")

return rows

We implement the static crawling part of the workflow using BeautifulSoupCrawler and ParselCrawler. With BeautifulSoupCrawler, we recursively crawl the local website and extract page titles, metadata, text previews, outgoing links, product details, documentation headings, code blocks, and blog tags. With ParselCrawler, we perform more targeted CSS and XPath extraction from product pages to collect clean product-level fields, including SKU, category, price, rating, stock, and features.

Dynamic Rendering with PlaywrightCrawler and Link Graphs

Copy CodeCopiedUse a different Browser

async def run_playwright_dynamic_crawl(base_url):

print("\n=== 3) PlaywrightCrawler: browser-rendered JavaScript crawl ===")

crawler = PlaywrightCrawler(

max_requests_per_crawl=2,

max_request_retries=1,

headless=True,

browser_type="chromium",

browser_launch_options={

"args": ["--no-sandbox", "--disable-dev-shm-usage"],

goto_options={

"wait_until": "domcontentloaded",

concurrency_settings=ConcurrencySettings(

desired_concurrency=1,

max_concurrency=2,

@crawler.router.default_handler

async def request_handler(context: PlaywrightCrawlingContext) -> None:

await context.page.wait_for_selector(".js-card", timeout=10000)

cards = await context.page.locator(".js-card").evaluate_all(

(cards) => cards.map((card) => {

const h3 = card.querySelector("h3");

const desc = card.querySelector(".desc");

const price = card.querySelector(".price");

sku: card.dataset.sku,

name: h3 ? h3.textContent.trim() : null,

description: desc ? desc.textContent.trim() : null,

price_text: price ? price.textContent.trim() : null,

price: Number(card.dataset.price),

stock: Number(card.dataset.stock),

rendered_text: card.innerText.trim()

screenshot_bytes = await context.page.screenshot(full_page=True)

screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"

screenshot_path.write_bytes(screenshot_bytes)

kvs = await context.get_key_value_store()

await kvs.set_value(

key="dynamic-catalog-full-page",

value=screenshot_bytes,

content_type="image/png",

except Exception as exc:

print("Key-value store screenshot save skipped:", repr(exc))

for card in cards:

"source": "playwright-rendered-js",

"url": context.request.url,

"screenshot_path": str(screenshot_path),

"extracted_at_unix": time.time(),

rows.append(row)

await context.push_data(rows)

await crawler.run([f"{base_url}/dynamic.html"])

except Exception as exc:

print("Playwright section failed gracefully.")

print("Reason:", repr(exc))

write_json(OUTPUT_DIR / "playwright_dynamic.json", rows)

write_csv(OUTPUT_DIR / "playwright_dynamic.csv", rows)

print(f"Playwright dynamic rows extracted: {len(rows)}")

return rows

def flatten_products(rows):

products = []

for row in rows:

if row.get("page_type") == "product" and isinstance(row.get("product"), dict):

product = row["product"]

products.append(

"source": row.get("source"),

"url": row.get("url"),

"sku": product.get("sku"),

"name": product.get("name"),

"category": product.get("category"),

"price": product.get("price"),

"rating": product.get("rating"),

"stock": product.get("stock"),

"features": "; ".join(product.get("features", [])),

elif row.get("source") == "parsel-precision":

products.append(

"source": row.get("source"),

"url": row.get("url"),

"sku": row.get("sku"),

"name": row.get("name"),

"category": row.get("category"),

"price": row.get("price"),

"rating": row.get("rating"),

"stock": row.get("stock"),

"features": "; ".join(row.get("features", [])),

elif row.get("source") == "playwright-rendered-js":

products.append(

"source": row.get("source"),

"url": row.get("url"),

"sku": row.get("sku"),

"name": row.get("name"),

"category": "dynamic-js",

"price": row.get("price") or money_to_float(row.get("price_text")),

"rating": None,

"stock": row.get("stock"),

"features": row.get("description"),

return products

def absolute_url(base_url, href):

if not href:

return None

if href.startswith("http://") or href.startswith("https://"):

return href

if href.startswith("/"):

return base_url + href

return base_url + "/" + href

def build_link_graph(base_url, rows):

graph = nx.DiGraph()

for row in rows:

src = row.get("url")

if not src:

graph.add_node(

title=row.get("title", ""),

page_type=row.get("page_type", ""),

for link in row.get("out_links", []) or []:

dst = absolute_url(base_url, link.get("href"))

if not dst:

if "/admin/" in dst:

graph.add_node(dst)

graph.add_edge(src, dst, label=link.get("label", ""))

return graph

We handle dynamic content using PlaywrightCrawler, which opens the JavaScript-rendered page in a headless Chromium browser. We wait for client-side product cards to appear, extract their rendered fields, capture a full-page screenshot, and save the browser-based results for later analysis. We then define helper functions to normalize product records and build a directed link graph from the internal links discovered during crawling.

Building AI-Ready Outputs and Running the Pipeline

Copy CodeCopiedUse a different Browser

def make_rag_chunks(rows, max_chars=700):

chunks = []

for row in rows:

row.get("text_preview")

or row.get("rendered_text")

or row.get("description")

text = normalize_text(text)

if not text:

sentences = re.split(r"(?<=[.!?])\s+", text)

current = ""

for sentence in sentences:

if len(current) + len(sentence) + 1 <= max_chars:

current = (current + " " + sentence).strip()

if current:

chunks.append(

"chunk_id": hashlib.sha1(

(row.get("url", "") + current).encode()

).hexdigest()[:12],

"url": row.get("url"),

"source": row.get("source"),

"page_type": row.get("page_type"),

"title": row.get("title") or row.get("name"),

"text": current,

current = sentence

if current:

chunks.append(

"chunk_id": hashlib.sha1(

(row.get("url", "") + current).encode()

).hexdigest()[:12],

"url": row.get("url"),

"source": row.get("source"),

"page_type": row.get("page_type"),

"title": row.get("title") or row.get("name"),

"text": current,

return chunks

def analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows):

all_rows = bs4_rows + parsel_rows + playwright_rows

products = flatten_products(all_rows)

crawl_df = pd.DataFrame(all_rows)

product_df = pd.DataFrame(products)

if not product_df.empty:

product_df["price"] = pd.to_numeric(product_df["price"], errors="coerce")

product_df["stock"] = pd.to_numeric(product_df["stock"], errors="coerce")

product_df["rating"] = pd.to_numeric(product_df["rating"], errors="coerce")

product_df["inventory_value"] = product_df["price"] * product_df["stock"]

graph = build_link_graph(base_url, bs4_rows)

graph_path = OUTPUT_DIR / "site_link_graph.graphml"

if graph.number_of_nodes() > 0:

nx.write_graphml(graph, graph_path)

chunks = make_rag_chunks(all_rows)

rag_path = OUTPUT_DIR / "rag_chunks.jsonl"

with rag_path.open("w", encoding="utf-8") as f:

for chunk in chunks:

f.write(json.dumps(chunk, ensure_ascii=False) + "\n")

crawl_json_path = OUTPUT_DIR / "combined_crawl_results.json"

crawl_json_path.write_text(

json.dumps(all_rows, ensure_ascii=False, indent=2),

encoding="utf-8",

product_csv_path = OUTPUT_DIR / "normalized_product_catalog.csv"

if not product_df.empty:

product_df.to_csv(product_csv_path, index=False)

price_plot_path = OUTPUT_DIR / "product_price_chart.png"

if not product_df.empty and product_df["price"].notna().any():

plot_df = product_df.dropna(subset=["price"]).copy()

plot_df["label"] = plot_df["sku"].fillna("unknown") + "\n" + plot_df["source"].fillna("")

ax = plot_df.plot(

kind="bar",

legend=False,

figsize=(11, 5),

title="Extracted Product Prices by Source",

ax.set_xlabel("Product / extraction source")

ax.set_ylabel("Price")

plt.xticks(rotation=35, ha="right")

plt.tight_layout()

plt.savefig(price_plot_path, dpi=160)

graph_stats = {

"nodes": graph.number_of_nodes(),

"edges": graph.number_of_edges(),

"weakly_connected_components": (

nx.number_weakly_connected_components(graph)

if graph.number_of_nodes()

if graph.number_of_nodes() > 0:

in_degrees = dict(graph.in_degree())

out_degrees = dict(graph.out_degree())

graph_stats["top_in_degree"] = sorted(

in_degrees.items(),

key=lambda x: x[1],

reverse=True,

graph_stats["top_out_degree"] = sorted(

out_degrees.items(),

key=lambda x: x[1],

reverse=True,

summary = {

"base_url": base_url,

"rows_total": len(all_rows),

"beautifulsoup_rows": len(bs4_rows),

"parsel_rows": len(parsel_rows),

"playwright_rows": len(playwright_rows),

"products_total": len(product_df),

"rag_chunks_total": len(chunks),

"graph": graph_stats,

"outputs": {

"beautifulsoup_json": str(OUTPUT_DIR / "beautifulsoup_crawl.json"),

"beautifulsoup_csv": str(OUTPUT_DIR / "beautifulsoup_crawl.csv"),

"parsel_json": str(OUTPUT_DIR / "parsel_products.json"),

"parsel_csv": str(OUTPUT_DIR / "parsel_products.csv"),

"playwright_json": str(OUTPUT_DIR / "playwright_dynamic.json"),

"playwright_csv": str(OUTPUT_DIR / "playwright_dynamic.csv"),

"combined_json": str(crawl_json_path),

"product_csv": str(product_csv_path) if product_csv_path.exists() else None,

"rag_jsonl": str(rag_path),

"graphml": str(graph_path) if graph_path.exists() else None,

"price_plot": str(price_plot_path) if price_plot_path.exists() else None,

"screenshots_dir": str(SCREENSHOT_DIR),

summary_path = OUTPUT_DIR / "run_summary.md"

summary_path.write_text(

"# Crawlee Python Advanced Tutorial Run Summary\n\n"

f"- Local demo site: `{base_url}`\n"

f"- Total extracted rows: `{summary['rows_total']}`\n"

f"- BeautifulSoup rows: `{summary['beautifulsoup_rows']}`\n"

f"- Parsel rows: `{summary['parsel_rows']}`\n"

f"- Playwright rows: `{summary['playwright_rows']}`\n"

f"- Normalized products: `{summary['products_total']}`\n"

f"- RAG chunks: `{summary['rag_chunks_total']}`\n"

f"- Link graph nodes: `{graph_stats['nodes']}`\n"

f"- Link graph edges: `{graph_stats['edges']}`\n\n"

"## Output files\n\n"

+ "\n".join(f"- `{k}`: `{v}`" for k, v in summary["outputs"].items())

encoding="utf-8",

print("\n=== 4) Analysis summary ===")

print(json.dumps(summary, indent=2, ensure_ascii=False))

from IPython.display import display, Markdown, Image as IPImage

display(Markdown("## Crawlee crawl preview"))

if not crawl_df.empty:

preview_cols = [

col for col in ["source", "page_type", "title", "url"]

if col in crawl_df.columns

display(crawl_df[preview_cols].head(12))

display(Markdown("## Normalized product catalog"))

if not product_df.empty:

display(product_df.head(20))

if price_plot_path.exists():

display(Markdown("## Product price chart"))

display(IPImage(filename=str(price_plot_path)))

screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"

if screenshot_path.exists():

display(Markdown("## Playwright screenshot of JavaScript-rendered page"))

display(IPImage(filename=str(screenshot_path)))

display(Markdown(f"## Output directory\n`{OUTPUT_DIR}`"))

except Exception as exc:

print("Notebook display skipped:", repr(exc))

return summary

async def main():

httpd, base_url = start_local_server(SITE_DIR)

print(f"\nLocal demo website is running at: {base_url}/index.html")

bs4_rows = await run_beautifulsoup_crawl(base_url)

parsel_rows = await run_parsel_precision_crawl(base_url)

playwright_rows = await run_playwright_dynamic_crawl(base_url)

summary = analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows)

return summary

httpd.shutdown()

print("\nLocal demo server shut down.")

loop = asyncio.get_event_loop()

summary = loop.run_until_complete(main())

print("\nTutorial complete.")

print(f"All outputs are in: {OUTPUT_DIR}")

print("Key files:")

for file_path in sorted(OUTPUT_DIR.rglob("*")):

if file_path.is_file():

print(" -", file_path)

We process the extracted crawl data into analysis-ready and AI-ready outputs. We create RAG-style JSONL chunks, combine all crawl results, build a normalized product catalog, generate a GraphML link graph, and visualize product prices with Matplotlib. Finally, we run the full pipeline end-to-end, display previews in the notebook, save all generated artifacts, and print the final output file paths.

In conclusion, we have a complete Crawlee-based pipeline for crawling and data engineering that converts a small website into structured, reusable datasets. We used crawl scoping, robots.txt handling, concurrency settings, link enqueuing, browser rendering, key-value storage, and dataset exports to simulate patterns used in production web crawling systems. We normalized the extracted product data, saved the crawl outputs as JSON and CSV, created GraphML link graphs with NetworkX, generated JSONL chunks for retrieval-augmented generation workflows, and visualized the extracted product prices with Matplotlib.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export appeared first on MarkTechPost.

Ver no Hugging Face

// relacionados

Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export

{title}

{product['name']}

Why this site exists

Featured crawler modules

Internal links for recursive crawling

{product['name']}

Features

Related modules

HTTP-first crawling strategy

Core extraction fields

Queue filtering

Storage design

Scaling crawler jobs without losing reliability

${{item.name}}

Dynamic content test

This page should be skipped

Leia também

How Businesses Are Building Specialized AI They Can Trust

ByteDance's Seedance 2.5 breaks the 30-second barrier for AI video generation

Foresight: ensinar o robô a saber quando vai falhar

NVIDIA Powers Over 400 of the World’s 500 Fastest Supercomputers