In this tutorial, we build a full Crawlee-for-Python workflow that covers environment setup, local website generation, static crawling, dynamic crawling, structured extraction, and downstream data processing. We begin by configuring a compatible Crawlee runtime with pinned Pydantic support, Playwright browser installation, persistent storage directories, and Colab-safe execution handling. We then generate a realistic local demo website containing product pages, documentation pages, blog content, internal links, robots.txt rules, JSON-LD metadata, and JavaScript-rendered catalog items. Using BeautifulSoupCrawler, we perform fast recursive HTML crawling and extract page titles, metadata, text previews, outgoing links, product attributes, documentation headings, code blocks, and blog tags. With ParselCrawler, we run precise CSS- and XPath-based extraction on product detail pages. With PlaywrightCrawler, we render JavaScript content in a headless Chromium browser, wait for dynamic DOM elements to appear, extract client-side data, and capture full-page screenshots.
raise RuntimeError(f"Command failed with exit code {result.returncode}: {command}")
print("PHASE 1: Installing compatible Crawlee + Pydantic + Playwright dependencies.")
print("After this finishes, Colab will restart automatically. Then run this same cell again.")
sh(f'{sys.executable} -m pip uninstall -y crawlee pydantic pydantic-core', check=False)
print("\nRestarting Colab runtime now. After it reconnects, run this same cell again.")
raise SystemExit("Setup complete. Restart the runtime/kernel manually, then run this cell again.")
print("PHASE 2: Dependencies are ready. Running the Crawlee tutorial.")
We begin by preparing the complete Colab runtime for the Crawlee tutorial. We install compatible versions of Crawlee, Pydantic, Playwright, and the required analysis libraries, and handle the automatic restart required after setup. We then configure storage folders, environment variables, crawler imports, and helper functions to ensure the rest of the workflow runs smoothly.
font-family: Inter, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
{product['category']} crawler module with rating {product['rating']}.
${product['price']:.2f}
Stock: {product['stock']}
write_file(
SITE_DIR / "index.html",
"Crawlee Demo Commerce + Docs Hub",
Why this site exists
This local website gives us predictable pages for testing Crawlee without scraping a third-party website.
We include static HTML pages, documentation pages, product detail pages, a blog article, robots.txt,
and a JavaScript-rendered page.
Featured crawler modules
{''.join(product_cards)}
Internal links for recursive crawling
Getting started guide
Advanced routing guide
Crawling at scale article
JavaScript-rendered catalog
Admin page blocked by robots and crawler filters
for product in PRODUCTS:
related_links = "\n".join(
f'
{sku}'
for sku in product["related"]
feature_list = "\n".join(f"
{feature}" for feature in product["features"])
json_ld = json.dumps(
"@context": "https://schema.org",
"@type": "Product",
"sku": product["sku"],
"name": product["name"],
"category": product["category"],
"offers": {
"@type": "Offer",
"price": product["price"],
"priceCurrency": "USD",
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": product["rating"],
write_file(
SITE_DIR / "products" / f"product-{safe_slug(product['sku'])}.html",
f"{product['name']} | Product Detail",
data-sku="{product['sku']}"
data-category="{product['category']}"
data-rating="{product['rating']}"
data-stock="{product['stock']}">
{product['name']}
SKU: {product['sku']}
Category: {product['category']}
${product['price']:.2f}
Rating: {product['rating']} / 5
Stock: {product['stock']}
Features
Related modules
We create a realistic product catalog that becomes the structured data source for our demo website. We define reusable HTML layout logic, styling, navigation, and page templates to make the local website look and behave like a small commercial and documentation portal. We then generate the homepage and product detail pages, including prices, ratings, stock levels, product features, related links, and JSON-LD metadata.
Adding Docs, Blog, Dynamic, and Admin Pages
Copy CodeCopiedUse a different Browser
write_file(
SITE_DIR / "docs" / "getting-started.html",
"Getting Started with Reliable Crawlers",
HTTP-first crawling strategy
We start with HTTP crawlers because they are lightweight and efficient.
Browser crawling is reserved for pages that need JavaScript rendering.
Core extraction fields
Each crawler extracts URL, title, page type, text summary, outgoing links, and page-specific metadata.
crawler = BeautifulSoupCrawler(max_requests_per_crawl=20)
Next: advanced routing
write_file(
SITE_DIR / "docs" / "advanced-routing.html",
"Advanced Routing and Storage",
Queue filtering
We filter links to keep the crawl focused on the same local domain and skip admin pages.
Storage design
Structured rows go to datasets. Binary screenshots and snapshots go to a key-value store.
await context.enqueue_links(include=[Glob("https://example.com/**")])
Read the scaling article
write_file(
SITE_DIR / "blog" / "crawling-at-scale.html",
"Crawling at Scale",
Scaling crawler jobs without losing reliability
Production crawlers need controlled concurrency, retry behavior, stable request queues,
structured exports, and monitoring-ready output.
For AI data workflows, we also normalize text, preserve source URLs, create chunks,
and record extraction provenance.
queues
datasets
rag
playwright
dynamic_items = json.dumps(
"sku": "JS-900",
"name": "Dynamic Inventory Scanner",
"price": 329.0,
"stock": 4,
"desc": "Rendered only after JavaScript executes.",
"sku": "JS-901",
"name": "Client-Side Review Miner",
"price": 279.0,
"stock": 11,
"desc": "Created by browser-side DOM manipulation.",
"sku": "JS-902",
"name": "Async Catalog Watcher",
"price": 389.0,
"stock": 7,
"desc": "Useful for testing PlaywrightCrawler extraction.",
dynamic_script = f"""
const dynamicItems = {dynamic_items};
function renderItems() {{
const root = document.querySelector("#dynamic-products");
root.innerHTML = "";
for (const item of dynamicItems) {{
const card = document.createElement("div");
card.className = "card js-card";
card.dataset.sku = item.sku;
card.dataset.price = item.price;
card.dataset.stock = item.stock;
card.innerHTML = `
${{item.name}}
${{item.desc}}
$${{item.price.toFixed(2)}}
Stock: ${{item.stock}}
root.appendChild(card);
document.querySelector("#render-status").textContent =
"Rendered " + dynamicItems.length + " JavaScript items.";
setTimeout(renderItems, 600);
write_file(
SITE_DIR / "dynamic.html",
"JavaScript Rendered Catalog",
Dynamic content test
A plain HTTP crawler can download this page, but it will not see the cards below until JavaScript runs.
PlaywrightCrawler opens a real browser and extracts the rendered DOM.
Waiting for JavaScript rendering...
extra_script=dynamic_script,
write_file(
SITE_DIR / "admin" / "hidden.html",
"Hidden Admin Page",
This page should be skipped
The crawler excludes this admin path to demonstrate control over the rawl scope
build_demo_site()
print(f"Demo site generated at: {SITE_DIR}")
class QuietHandler(SimpleHTTPRequestHandler):
def log_message(self, format, *args):
def start_local_server(directory):
probe = socket.socket()
probe.bind(("127.0.0.1", 0))
port = probe.getsockname()[1]
probe.close()
handler = partial(QuietHandler, directory=str(directory))
httpd = ThreadingHTTPServer(("127.0.0.1", port), handler)
thread = threading.Thread(target=httpd.serve_forever, daemon=True)
thread.start()
base_url = f"http://127.0.0.1:{port}"
time.sleep(0.5)
return httpd, base_url
def extract_json_ld(soup):
blocks = []
for script in soup.select('script[type="application/ld+json"]'):
raw = script.string or script.get_text()
if not raw:
blocks.append(json.loads(raw))
except Exception:
blocks.append({"raw": raw})
return blocks
def write_json(path, rows):
path = Path(path)
path.write_text(json.dumps(rows, ensure_ascii=False, indent=2), encoding="utf-8")
def write_csv(path, rows):
path = Path(path)
if not rows:
path.write_text("", encoding="utf-8")
flattened = []
for row in rows:
for key, value in row.items():
if isinstance(value, (list, dict)):
flat[key] = json.dumps(value, ensure_ascii=False)
flat[key] = value
flattened.append(flat)
fieldnames = sorted({key for row in flattened for key in row.keys()})
with path.open("w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(flattened)
We expand the demo website by adding documentation pages, a blog article, a JavaScript-rendered catalog page, and an admin page intended to be excluded from crawling. We use these pages to test different crawling scenarios, including static HTML extraction, documentation parsing, blog metadata extraction, dynamic browser rendering, and crawl filtering. We also start a local HTTP server and define utilities to extract JSON-LD content and export crawl results to JSON and CSV.
Static Crawling with BeautifulSoupCrawler and ParselCrawler
Copy CodeCopiedUse a different Browser
async def run_beautifulsoup_crawl(base_url):
print("\n=== 1) BeautifulSoupCrawler: fast recursive HTTP crawl ===")
crawler = BeautifulSoupCrawler(
parser="html.parser",
max_requests_per_crawl=30,
max_request_retries=1,
respect_robots_txt_file=True,
concurrency_settings=ConcurrencySettings(
desired_concurrency=4,
max_concurrency=6,
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
soup = context.soup
url = context.request.url
title = normalize_text(soup.title.get_text(" ", strip=True) if soup.title else "")
meta_description = ""
meta_tag = soup.find("meta", attrs={"name": "description"})
if meta_tag:
meta_description = normalize_text(meta_tag.get("content", ""))
out_links = []
for a in soup.select("a[href]"):
href = a.get("href")
label = normalize_text(a.get_text(" ", strip=True), 120)
out_links.append({"href": href, "label": label})
page_text = normalize_text(soup.get_text(" ", strip=True), 1000)
if "/products/" in url:
page_type = "product"
elif "/docs/" in url:
page_type = "documentation"
elif "/blog/" in url:
page_type = "blog"
elif "/dynamic" in url:
page_type = "dynamic-shell"
page_type = "index"
"source": "beautifulsoup-http",
"url": url,
"title": title,
"page_type": page_type,
"meta_description": meta_description,
"text_preview": page_text,
"out_links": out_links,
"json_ld": extract_json_ld(soup),
"extracted_at_unix": time.time(),
if page_type == "product":
article = soup.select_one("article.product")
if article:
price_node = soup.select_one(".price")
row["product"] = {
"sku": article.get("data-sku"),
"category": article.get("data-category"),
"name": normalize_text(
soup.select_one(".product-title").get_text(" ", strip=True)
if soup.select_one(".product-title")
"price": money_to_float(price_node.get("data-price") if price_node else None),
"rating": float(article.get("data-rating")) if article.get("data-rating") else None,
"stock": int(article.get("data-stock")) if article.get("data-stock") else None,
"features": [
normalize_text(li.get_text(" ", strip=True))
for li in soup.select(".features li")
if page_type == "documentation":
row["doc"] = {
"headings": [
normalize_text(h.get_text(" ", strip=True))
for h in soup.select("h2, h3")
"code_blocks": [
normalize_text(code.get_text(" ", strip=True))
for code in soup.select("pre code")
if page_type == "blog":
row["blog"] = {
"author": soup.select_one(".blog-post").get("data-author") if soup.select_one(".blog-post") else None,
"reading_time": soup.select_one(".blog-post").get("data-reading-time") if soup.select_one(".blog-post") else None,
normalize_text(tag.get_text(" ", strip=True))
for tag in soup.select(".tag")
rows.append(row)
await context.push_data(row)
await context.enqueue_links(
include=[Glob(f"{base_url}/**")],
Glob(f"{base_url}/admin/**"),
Glob(f"{base_url}/dynamic.html"),
await crawler.run([f"{base_url}/index.html"])
write_json(OUTPUT_DIR / "beautifulsoup_crawl.json", rows)
write_csv(OUTPUT_DIR / "beautifulsoup_crawl.csv", rows)
print(f"BeautifulSoup rows extracted: {len(rows)}")
return rows
async def run_parsel_precision_crawl(base_url):
print("\n=== 2) ParselCrawler: precise CSS/XPath extraction from product pages ===")
product_urls = [
f"{base_url}/products/product-{safe_slug(product['sku'])}.html"
for product in PRODUCTS
crawler = ParselCrawler(
max_requests_per_crawl=len(product_urls),
max_request_retries=1,
concurrency_settings=ConcurrencySettings(
desired_concurrency=5,
max_concurrency=8,
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
selector = context.selector
title = selector.css("title::text").get()
sku = selector.css("article.product::attr(data-sku)").get()
category = selector.css("article.product::attr(data-category)").get()
rating = selector.css("article.product::attr(data-rating)").get()
stock = selector.css("article.product::attr(data-stock)").get()
name = selector.css(".product-title::text").get()
price = selector.css(".price::attr(data-price)").get()
features = [
normalize_text(feature)
for feature in selector.css(".features li::text").getall()
"source": "parsel-precision",
"url": context.request.url,
"title": normalize_text(title),
"sku": sku,
"name": normalize_text(name),
"category": category,
"price": money_to_float(price),
"rating": float(rating) if rating else None,
"stock": int(stock) if stock else None,
"features": features,
"xpath_title": normalize_text(selector.xpath("//title/text()").get()),
rows.append(row)
await context.push_data(row)
await crawler.run(product_urls)
write_json(OUTPUT_DIR / "parsel_products.json", rows)
write_csv(OUTPUT_DIR / "parsel_products.csv", rows)
print(f"Parsel product rows extracted: {len(rows)}")
return rows
We implement the static crawling part of the workflow using BeautifulSoupCrawler and ParselCrawler. With BeautifulSoupCrawler, we recursively crawl the local website and extract page titles, metadata, text previews, outgoing links, product details, documentation headings, code blocks, and blog tags. With ParselCrawler, we perform more targeted CSS and XPath extraction from product pages to collect clean product-level fields, including SKU, category, price, rating, stock, and features.
Dynamic Rendering with PlaywrightCrawler and Link Graphs
Copy CodeCopiedUse a different Browser
async def run_playwright_dynamic_crawl(base_url):
print("\n=== 3) PlaywrightCrawler: browser-rendered JavaScript crawl ===")
crawler = PlaywrightCrawler(
max_requests_per_crawl=2,
max_request_retries=1,
headless=True,
browser_type="chromium",
browser_launch_options={
"args": ["--no-sandbox", "--disable-dev-shm-usage"],
goto_options={
"wait_until": "domcontentloaded",
concurrency_settings=ConcurrencySettings(
desired_concurrency=1,
max_concurrency=2,
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
await context.page.wait_for_selector(".js-card", timeout=10000)
cards = await context.page.locator(".js-card").evaluate_all(
(cards) => cards.map((card) => {
const h3 = card.querySelector("h3");
const desc = card.querySelector(".desc");
const price = card.querySelector(".price");
sku: card.dataset.sku,
name: h3 ? h3.textContent.trim() : null,
description: desc ? desc.textContent.trim() : null,
price_text: price ? price.textContent.trim() : null,
price: Number(card.dataset.price),
stock: Number(card.dataset.stock),
rendered_text: card.innerText.trim()
screenshot_bytes = await context.page.screenshot(full_page=True)
screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"
screenshot_path.write_bytes(screenshot_bytes)
kvs = await context.get_key_value_store()
await kvs.set_value(
key="dynamic-catalog-full-page",
value=screenshot_bytes,
content_type="image/png",
except Exception as exc:
print("Key-value store screenshot save skipped:", repr(exc))
for card in cards:
"source": "playwright-rendered-js",
"url": context.request.url,
"screenshot_path": str(screenshot_path),
"extracted_at_unix": time.time(),
rows.append(row)
await context.push_data(rows)
await crawler.run([f"{base_url}/dynamic.html"])
except Exception as exc:
print("Playwright section failed gracefully.")
print("Reason:", repr(exc))
write_json(OUTPUT_DIR / "playwright_dynamic.json", rows)
write_csv(OUTPUT_DIR / "playwright_dynamic.csv", rows)
print(f"Playwright dynamic rows extracted: {len(rows)}")
return rows
def flatten_products(rows):
products = []
for row in rows:
if row.get("page_type") == "product" and isinstance(row.get("product"), dict):
product = row["product"]
products.append(
"source": row.get("source"),
"url": row.get("url"),
"sku": product.get("sku"),
"name": product.get("name"),
"category": product.get("category"),
"price": product.get("price"),
"rating": product.get("rating"),
"stock": product.get("stock"),
"features": "; ".join(product.get("features", [])),
elif row.get("source") == "parsel-precision":
products.append(
"source": row.get("source"),
"url": row.get("url"),
"sku": row.get("sku"),
"name": row.get("name"),
"category": row.get("category"),
"price": row.get("price"),
"rating": row.get("rating"),
"stock": row.get("stock"),
"features": "; ".join(row.get("features", [])),
elif row.get("source") == "playwright-rendered-js":
products.append(
"source": row.get("source"),
"url": row.get("url"),
"sku": row.get("sku"),
"name": row.get("name"),
"category": "dynamic-js",
"price": row.get("price") or money_to_float(row.get("price_text")),
"rating": None,
"stock": row.get("stock"),
"features": row.get("description"),
return products
def absolute_url(base_url, href):
if not href:
return None
if href.startswith("http://") or href.startswith("https://"):
return href
if href.startswith("/"):
return base_url + href
return base_url + "/" + href
def build_link_graph(base_url, rows):
graph = nx.DiGraph()
for row in rows:
src = row.get("url")
if not src:
graph.add_node(
title=row.get("title", ""),
page_type=row.get("page_type", ""),
for link in row.get("out_links", []) or []:
dst = absolute_url(base_url, link.get("href"))
if not dst:
if "/admin/" in dst:
graph.add_node(dst)
graph.add_edge(src, dst, label=link.get("label", ""))
return graph
We handle dynamic content using PlaywrightCrawler, which opens the JavaScript-rendered page in a headless Chromium browser. We wait for client-side product cards to appear, extract their rendered fields, capture a full-page screenshot, and save the browser-based results for later analysis. We then define helper functions to normalize product records and build a directed link graph from the internal links discovered during crawling.
Building AI-Ready Outputs and Running the Pipeline
Copy CodeCopiedUse a different Browser
def make_rag_chunks(rows, max_chars=700):
chunks = []
for row in rows:
row.get("text_preview")
or row.get("rendered_text")
or row.get("description")
text = normalize_text(text)
if not text:
sentences = re.split(r"(?<=[.!?])\s+", text)
current = ""
for sentence in sentences:
if len(current) + len(sentence) + 1 <= max_chars:
current = (current + " " + sentence).strip()
if current:
chunks.append(
"chunk_id": hashlib.sha1(
(row.get("url", "") + current).encode()
).hexdigest()[:12],
"url": row.get("url"),
"source": row.get("source"),
"page_type": row.get("page_type"),
"title": row.get("title") or row.get("name"),
"text": current,
current = sentence
if current:
chunks.append(
"chunk_id": hashlib.sha1(
(row.get("url", "") + current).encode()
).hexdigest()[:12],
"url": row.get("url"),
"source": row.get("source"),
"page_type": row.get("page_type"),
"title": row.get("title") or row.get("name"),
"text": current,
return chunks
def analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows):
all_rows = bs4_rows + parsel_rows + playwright_rows
products = flatten_products(all_rows)
crawl_df = pd.DataFrame(all_rows)
product_df = pd.DataFrame(products)
if not product_df.empty:
product_df["price"] = pd.to_numeric(product_df["price"], errors="coerce")
product_df["stock"] = pd.to_numeric(product_df["stock"], errors="coerce")
product_df["rating"] = pd.to_numeric(product_df["rating"], errors="coerce")
product_df["inventory_value"] = product_df["price"] * product_df["stock"]
graph = build_link_graph(base_url, bs4_rows)
graph_path = OUTPUT_DIR / "site_link_graph.graphml"
if graph.number_of_nodes() > 0:
nx.write_graphml(graph, graph_path)
chunks = make_rag_chunks(all_rows)
rag_path = OUTPUT_DIR / "rag_chunks.jsonl"
with rag_path.open("w", encoding="utf-8") as f:
for chunk in chunks:
f.write(json.dumps(chunk, ensure_ascii=False) + "\n")
crawl_json_path = OUTPUT_DIR / "combined_crawl_results.json"
crawl_json_path.write_text(
json.dumps(all_rows, ensure_ascii=False, indent=2),
encoding="utf-8",
product_csv_path = OUTPUT_DIR / "normalized_product_catalog.csv"
if not product_df.empty:
product_df.to_csv(product_csv_path, index=False)
price_plot_path = OUTPUT_DIR / "product_price_chart.png"
if not product_df.empty and product_df["price"].notna().any():
plot_df = product_df.dropna(subset=["price"]).copy()
plot_df["label"] = plot_df["sku"].fillna("unknown") + "\n" + plot_df["source"].fillna("")
ax = plot_df.plot(
kind="bar",
legend=False,
figsize=(11, 5),
title="Extracted Product Prices by Source",
ax.set_xlabel("Product / extraction source")
ax.set_ylabel("Price")
plt.xticks(rotation=35, ha="right")
plt.tight_layout()
plt.savefig(price_plot_path, dpi=160)
graph_stats = {
"nodes": graph.number_of_nodes(),
"edges": graph.number_of_edges(),
"weakly_connected_components": (
nx.number_weakly_connected_components(graph)
if graph.number_of_nodes()
if graph.number_of_nodes() > 0:
in_degrees = dict(graph.in_degree())
out_degrees = dict(graph.out_degree())
graph_stats["top_in_degree"] = sorted(
in_degrees.items(),
key=lambda x: x[1],
reverse=True,
graph_stats["top_out_degree"] = sorted(
out_degrees.items(),
key=lambda x: x[1],
reverse=True,
summary = {
"base_url": base_url,
"rows_total": len(all_rows),
"beautifulsoup_rows": len(bs4_rows),
"parsel_rows": len(parsel_rows),
"playwright_rows": len(playwright_rows),
"products_total": len(product_df),
"rag_chunks_total": len(chunks),
"graph": graph_stats,
"outputs": {
"beautifulsoup_json": str(OUTPUT_DIR / "beautifulsoup_crawl.json"),
"beautifulsoup_csv": str(OUTPUT_DIR / "beautifulsoup_crawl.csv"),
"parsel_json": str(OUTPUT_DIR / "parsel_products.json"),
"parsel_csv": str(OUTPUT_DIR / "parsel_products.csv"),
"playwright_json": str(OUTPUT_DIR / "playwright_dynamic.json"),
"playwright_csv": str(OUTPUT_DIR / "playwright_dynamic.csv"),
"combined_json": str(crawl_json_path),
"product_csv": str(product_csv_path) if product_csv_path.exists() else None,
"rag_jsonl": str(rag_path),
"graphml": str(graph_path) if graph_path.exists() else None,
"price_plot": str(price_plot_path) if price_plot_path.exists() else None,
"screenshots_dir": str(SCREENSHOT_DIR),
summary_path = OUTPUT_DIR / "run_summary.md"
summary_path.write_text(
"# Crawlee Python Advanced Tutorial Run Summary\n\n"
f"- Local demo site: `{base_url}`\n"
f"- Total extracted rows: `{summary['rows_total']}`\n"
f"- BeautifulSoup rows: `{summary['beautifulsoup_rows']}`\n"
f"- Parsel rows: `{summary['parsel_rows']}`\n"
f"- Playwright rows: `{summary['playwright_rows']}`\n"
f"- Normalized products: `{summary['products_total']}`\n"
f"- RAG chunks: `{summary['rag_chunks_total']}`\n"
f"- Link graph nodes: `{graph_stats['nodes']}`\n"
f"- Link graph edges: `{graph_stats['edges']}`\n\n"
"## Output files\n\n"
+ "\n".join(f"- `{k}`: `{v}`" for k, v in summary["outputs"].items())
encoding="utf-8",
print("\n=== 4) Analysis summary ===")
print(json.dumps(summary, indent=2, ensure_ascii=False))
from IPython.display import display, Markdown, Image as IPImage
display(Markdown("## Crawlee crawl preview"))
if not crawl_df.empty:
preview_cols = [
col for col in ["source", "page_type", "title", "url"]
if col in crawl_df.columns
display(crawl_df[preview_cols].head(12))
display(Markdown("## Normalized product catalog"))
if not product_df.empty:
display(product_df.head(20))
if price_plot_path.exists():
display(Markdown("## Product price chart"))
display(IPImage(filename=str(price_plot_path)))
screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"
if screenshot_path.exists():
display(Markdown("## Playwright screenshot of JavaScript-rendered page"))
display(IPImage(filename=str(screenshot_path)))
display(Markdown(f"## Output directory\n`{OUTPUT_DIR}`"))
except Exception as exc:
print("Notebook display skipped:", repr(exc))
return summary
async def main():
httpd, base_url = start_local_server(SITE_DIR)
print(f"\nLocal demo website is running at: {base_url}/index.html")
bs4_rows = await run_beautifulsoup_crawl(base_url)
parsel_rows = await run_parsel_precision_crawl(base_url)
playwright_rows = await run_playwright_dynamic_crawl(base_url)
summary = analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows)
return summary
httpd.shutdown()
print("\nLocal demo server shut down.")
loop = asyncio.get_event_loop()
summary = loop.run_until_complete(main())
print("\nTutorial complete.")
print(f"All outputs are in: {OUTPUT_DIR}")
print("Key files:")
for file_path in sorted(OUTPUT_DIR.rglob("*")):
if file_path.is_file():
print(" -", file_path)
We process the extracted crawl data into analysis-ready and AI-ready outputs. We create RAG-style JSONL chunks, combine all crawl results, build a normalized product catalog, generate a GraphML link graph, and visualize product prices with Matplotlib. Finally, we run the full pipeline end-to-end, display previews in the notebook, save all generated artifacts, and print the final output file paths.
In conclusion, we have a complete Crawlee-based pipeline for crawling and data engineering that converts a small website into structured, reusable datasets. We used crawl scoping, robots.txt handling, concurrency settings, link enqueuing, browser rendering, key-value storage, and dataset exports to simulate patterns used in production web crawling systems. We normalized the extracted product data, saved the crawl outputs as JSON and CSV, created GraphML link graphs with NetworkX, generated JSONL chunks for retrieval-augmented generation workflows, and visualized the extracted product prices with Matplotlib.
Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export appeared first on MarkTechPost.