GC-301f · Module 3
Web Scraping
3 min read
Web scraping tools extract structured data from HTML pages. The tool navigates to a URL, waits for the content to render, then uses CSS selectors or XPath to pull specific elements. Return data as structured objects — not raw HTML. A scraping tool for a product listing page should return an array of { name, price, url, inStock } objects, not a blob of HTML that Gemini must parse. The tool handles the parsing. Gemini handles the reasoning.
Pagination support turns a single-page scraper into a dataset collector. Two patterns work: cursor-based pagination where the tool follows "next page" links until exhausted, and offset-based pagination where the tool accepts a page number parameter. For cursor-based, set a maximum page count — 10 pages is usually sufficient. For offset-based, let Gemini control the pagination by calling the tool repeatedly with incrementing page numbers. Return a hasMore flag so Gemini knows when to stop.
async function scrapeListings(url: string, maxPages = 5) {
const results: any[] = [];
let currentUrl = url;
let pageCount = 0;
while (currentUrl && pageCount < maxPages) {
await page.goto(currentUrl, { waitUntil: "networkidle" });
const items = await page.$eval(".listing-card", (cards) =>
cards.map((card) => ({
title: card.querySelector("h3")?.textContent?.trim() ?? "",
price: card.querySelector(".price")?.textContent?.trim() ?? "",
link: card.querySelector("a")?.getAttribute("href") ?? "",
}))
);
results.push(...items);
pageCount++;
const nextBtn = await page.$("a.next-page");
currentUrl = nextBtn
? await nextBtn.getAttribute("href") ?? ""
: "";
}
return { items: results, pages: pageCount, total: results.length };
}
Rate limiting and robots.txt compliance are not optional. Check robots.txt before scraping any domain — the tool should fetch and parse it at first request to a new domain, then cache the rules. Respect Crawl-delay directives. For sites without explicit rate limits, enforce a minimum 1-second delay between requests. Aggressive scraping gets IP addresses banned, which breaks all browser tools for the session. Include a descriptive User-Agent header so site operators can contact you if there is an issue.