GC-301f · Module 3

Web Scraping

3 min read

Web scraping tools extract structured data from HTML pages. The tool navigates to a URL, waits for the content to render, then uses CSS selectors or XPath to pull specific elements. Return data as structured objects — not raw HTML. A scraping tool for a product listing page should return an array of { name, price, url, inStock } objects, not a blob of HTML that Gemini must parse. The tool handles the parsing. Gemini handles the reasoning.

Pagination support turns a single-page scraper into a dataset collector. Two patterns work: cursor-based pagination where the tool follows "next page" links until exhausted, and offset-based pagination where the tool accepts a page number parameter. For cursor-based, set a maximum page count — 10 pages is usually sufficient. For offset-based, let Gemini control the pagination by calling the tool repeatedly with incrementing page numbers. Return a hasMore flag so Gemini knows when to stop.

async function scrapeListings(url: string, maxPages = 5) {
  const results: any[] = [];
  let currentUrl = url;
  let pageCount = 0;

  while (currentUrl && pageCount < maxPages) {
    await page.goto(currentUrl, { waitUntil: "networkidle" });

    const items = await page.$eval(".listing-card", (cards) =>
      cards.map((card) => ({
        title: card.querySelector("h3")?.textContent?.trim() ?? "",
        price: card.querySelector(".price")?.textContent?.trim() ?? "",
        link: card.querySelector("a")?.getAttribute("href") ?? "",
      }))
    );

    results.push(...items);
    pageCount++;

    const nextBtn = await page.$("a.next-page");
    currentUrl = nextBtn
      ? await nextBtn.getAttribute("href") ?? ""
      : "";
  }

  return { items: results, pages: pageCount, total: results.length };
}

Rate limiting and robots.txt compliance are not optional. Check robots.txt before scraping any domain — the tool should fetch and parse it at first request to a new domain, then cache the rules. Respect Crawl-delay directives. For sites without explicit rate limits, enforce a minimum 1-second delay between requests. Aggressive scraping gets IP addresses banned, which breaks all browser tools for the session. Include a descriptive User-Agent header so site operators can contact you if there is an issue.