How to Scrape Google Search Results with Python

intermediate25 minutes

Prerequisites

• Python 3.10+ installed
• Basic understanding of HTTP requests and HTML
• Familiarity with CSS selectors
• A residential proxy provider (for production use)

Google Search is the single most scraped website on the internet — and for good reason. SERP data powers SEO monitoring, competitor analysis, ad verification, and market research. Google's official APIs either lack the data you need (Custom Search API returns different results than real SERPs) or cap you at 100 queries per day. Scraping gives you the actual results real users see, at scale. The catch: Google's anti-bot defenses are among the most sophisticated online. This guide walks through building a reliable Google SERP scraper in Python, from first request to production-ready pipeline.

Understand Google's Anti-Bot Defenses

Google uses a layered detection system that catches most naive scraping attempts within the first few requests: 1. **Rate limiting** — Too many queries from one IP triggers a CAPTCHA or temporary block. The threshold varies by IP reputation but can be as low as 10-20 queries per hour for datacenter IPs. 2. **TLS fingerprinting** — Google inspects the TLS handshake (JA3/JA4 fingerprints) to identify HTTP libraries. Standard Python `requests` is flagged instantly. 3. **Cookie and session tracking** — Google sets tracking cookies on the first request. Subsequent requests without consistent cookies look suspicious. Requests that suddenly arrive without a prior homepage visit also raise flags. 4. **JavaScript challenges** — Some queries (especially commercial intent) trigger JS-based verification that headless browsers with default configs fail. 5. **Behavioral signals** — Google analyzes query patterns. Scraping queries A-Z alphabetically is obviously non-human. The key insight: Google's detection is cumulative. Any single signal might not trigger a block, but combining a suspicious TLS fingerprint with datacenter IPs and high-frequency requests guarantees a CAPTCHA wall.

Tip: Google treats different query types differently. Informational queries ('what is web scraping') are less protected than commercial queries ('buy running shoes'). Test your scraper on informational queries first.

Set Up Your Python Environment

We will use `curl_cffi` for HTTP requests because it can impersonate real browser TLS fingerprints, and `selectolax` for fast HTML parsing. Avoid `requests` and `BeautifulSoup` — the former has a recognizable TLS fingerprint, and the latter is unnecessarily slow for structured parsing.

python3 -m venv google-scraper
source google-scraper/bin/activate
pip install curl_cffi selectolax fake-useragent

Tip: Pin your dependency versions in a requirements.txt. Anti-bot detection evolves, and you may need to roll back if an update introduces regressions in fingerprint impersonation.

Make Your First SERP Request

The basic approach: hit Google's search URL with a query parameter, impersonate Chrome's TLS stack, and include realistic headers. The `hl` and `gl` parameters control language and geolocation — always set these explicitly to get consistent results.

from curl_cffi import requests
from urllib.parse import urlencode

def search_google(query: str, page: int = 0, gl: str = "us", hl: str = "en") -> str:
    """Fetch a Google SERP page and return raw HTML."""
    params = {
        "q": query,
        "start": page * 10,
        "gl": gl,
        "hl": hl,
        "num": 10,
    }
    url = f"https://www.google.com/search?{urlencode(params)}"

    session = requests.Session(impersonate="chrome")
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "https://www.google.com/",
    }

    response = session.get(url, headers=headers, timeout=15)
    response.raise_for_status()
    return response.text

# Test it
html = search_google("best web scraping tools 2026")
print(f"Got {len(html)} bytes")

Tip: Always include a Referer header pointing to google.com. Requests arriving at a search results page without a referrer look suspicious — real users navigate from the Google homepage or their browser address bar.

Parse Organic Search Results

Google's SERP HTML is notoriously messy — deeply nested divs with auto-generated class names that change periodically. The most reliable selectors target structural patterns rather than specific class names. Each organic result lives inside a `div` with a `data-sokoban` or `data-hveid` attribute, and the link, title, and snippet follow a consistent nested structure.

from selectolax.parser import HTMLParser
from dataclasses import dataclass, asdict

@dataclass
class SearchResult:
    position: int
    title: str
    url: str
    snippet: str
    displayed_url: str

def parse_serp(html: str) -> list[SearchResult]:
    """Extract organic results from a Google SERP page."""
    tree = HTMLParser(html)
    results = []
    position = 1

    # Google wraps organic results in divs with class 'g' or nested structures
    for block in tree.css("div.g"):
        # Extract the link element
        link_el = block.css_first("a[href]")
        if not link_el:
            continue

        url = link_el.attributes.get("href", "")
        # Skip non-http results (e.g., internal Google links)
        if not url.startswith("http"):
            continue

        # Title is inside an h3 within the link
        title_el = block.css_first("h3")
        title = title_el.text(strip=True) if title_el else ""

        # Snippet text — usually in a div with data-sncf attribute or nested spans
        snippet_el = (
            block.css_first("div[data-sncf]") or
            block.css_first("div.VwiC3b") or
            block.css_first("span.st")
        )
        snippet = snippet_el.text(strip=True) if snippet_el else ""

        # Displayed URL (the green text)
        cite_el = block.css_first("cite")
        displayed_url = cite_el.text(strip=True) if cite_el else ""

        if title and url:
            results.append(SearchResult(
                position=position,
                title=title,
                url=url,
                snippet=snippet,
                displayed_url=displayed_url,
            ))
            position += 1

    return results

# Usage
html = search_google("web scraping python tutorial")
results = parse_serp(html)
for r in results:
    print(f"{r.position}. {r.title} — {r.url}")

Tip: Google's class names change frequently. Build a monitoring system that alerts you when your parser returns zero results for a known query — that is usually a sign of a selector change, not a block.

Handle Pagination

Google paginates search results using the `start` parameter (0 for page 1, 10 for page 2, etc.). Most use cases only need the first 3-5 pages — going deeper increases detection risk with diminishing data value. Introduce a delay between pages and check for block signals before proceeding.

import time
import random

def scrape_serp_pages(
    query: str,
    max_pages: int = 3,
    gl: str = "us",
    hl: str = "en",
) -> list[SearchResult]:
    """Scrape multiple pages of Google results for a query."""
    all_results = []

    for page in range(max_pages):
        html = search_google(query, page=page, gl=gl, hl=hl)

        # Check for CAPTCHA or block
        if is_blocked(html):
            print(f"Blocked on page {page + 1} for query: {query}")
            break

        results = parse_serp(html)
        if not results:
            # No results likely means we hit the last page
            break

        # Adjust positions to be absolute (not per-page)
        offset = page * 10
        for r in results:
            r.position += offset

        all_results.extend(results)

        # Human-like delay between pages
        if page < max_pages - 1:
            delay = random.uniform(2.0, 6.0)
            time.sleep(delay)

    return all_results

def is_blocked(html: str) -> bool:
    """Detect if Google is serving a CAPTCHA or block page."""
    signals = [
        "detected unusual traffic",
        "recaptcha",
        "/sorry/index",
        "systems have detected unusual traffic",
    ]
    html_lower = html.lower()
    return any(signal in html_lower for signal in signals)

Tip: Scraping pages 1-3 covers positions 1-30, which accounts for over 95% of organic clicks. Going beyond page 5 rarely justifies the additional detection risk.

Deal with CAPTCHAs and Rate Limits

When Google detects automated behavior, it responds with a CAPTCHA page (usually reCAPTCHA) or a 429 status code. Your scraper needs a strategy for both prevention and recovery. Prevention: - Limit to 20-30 queries per IP per hour - Randomize query order (do not scrape alphabetically) - Rotate IPs across a pool of residential proxies - Maintain session cookies across requests from the same IP Recovery: - On CAPTCHA: retire the current IP for 1-2 hours, switch to a fresh session - On 429: exponential backoff starting at 60 seconds - On repeated blocks: your fingerprint or proxy quality may be the issue

from collections import defaultdict
from datetime import datetime, timedelta

class SERPScraper:
    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self.cooldowns: dict[str, datetime] = {}
        self.query_count: dict[str, int] = defaultdict(int)

    def get_available_proxy(self) -> str | None:
        now = datetime.now()
        available = [
            p for p in self.proxies
            if p not in self.cooldowns or now > self.cooldowns[p]
        ]
        if not available:
            return None
        # Prefer proxies with fewer recent queries
        return min(available, key=lambda p: self.query_count[p])

    def retire_proxy(self, proxy: str, duration_minutes: int = 90):
        self.cooldowns[proxy] = datetime.now() + timedelta(minutes=duration_minutes)

    def search(self, query: str, proxy: str | None = None) -> list[SearchResult]:
        if proxy is None:
            proxy = self.get_available_proxy()
        if proxy is None:
            raise RuntimeError("No proxies available — all on cooldown")

        session = requests.Session(impersonate="chrome")
        proxies_dict = {"https": proxy, "http": proxy}

        params = urlencode({"q": query, "gl": "us", "hl": "en", "num": 10})
        url = f"https://www.google.com/search?{params}"

        try:
            response = session.get(
                url,
                proxies=proxies_dict,
                headers={"Accept-Language": "en-US,en;q=0.9"},
                timeout=15,
            )
            if is_blocked(response.text):
                self.retire_proxy(proxy)
                return []

            self.query_count[proxy] += 1
            return parse_serp(response.text)

        except Exception as e:
            print(f"Request failed: {e}")
            self.retire_proxy(proxy, duration_minutes=30)
            return []

Tip: If you are consistently hitting CAPTCHAs with fewer than 20 queries per IP, the problem is likely your TLS fingerprint or header configuration, not your rate. Fix the fingerprint before adding more proxies.

Structure and Export Your Data

Raw SERP data is only useful when it is structured for analysis. Export your results in a format that supports downstream workflows — CSV for spreadsheets, JSON for APIs, or directly into a database for time-series tracking.

import json
import csv
from dataclasses import asdict
from datetime import date

def export_to_json(results: list[SearchResult], query: str, filepath: str):
    data = {
        "query": query,
        "scraped_at": date.today().isoformat(),
        "result_count": len(results),
        "results": [asdict(r) for r in results],
    }
    with open(filepath, "w") as f:
        json.dump(data, f, indent=2)

def export_to_csv(results: list[SearchResult], query: str, filepath: str):
    with open(filepath, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["query", "position", "title", "url", "snippet", "displayed_url"])
        for r in results:
            writer.writerow([query, r.position, r.title, r.url, r.snippet, r.displayed_url])

# Example: daily SERP tracking
queries = [
    "best project management software",
    "web scraping tools comparison",
    "python data extraction tutorial",
]

scraper = SERPScraper(proxies=["http://user:pass@proxy1:8080", "http://user:pass@proxy2:8080"])

for query in queries:
    results = scraper.search(query)
    if results:
        safe_name = query.replace(" ", "_")[:50]
        export_to_json(results, query, f"serp_{safe_name}.json")
        print(f"Saved {len(results)} results for: {query}")
    time.sleep(random.uniform(5, 15))

Scale with Real-Device Infrastructure

The DIY approach above handles moderate volumes — a few hundred queries per day for SEO monitoring or competitor tracking. At higher volumes (thousands of daily queries across multiple geolocations), maintaining your own proxy pool, fingerprint rotation, and CAPTCHA recovery becomes a full-time engineering problem. The fundamental challenge is that Google's detection has moved beyond IP and TLS fingerprinting into holistic device profiling. They correlate dozens of signals — screen resolution, GPU renderer, installed fonts, audio fingerprint, WebGL hash — to determine whether a request comes from a real device. Spoofing all of these consistently across thousands of sessions is fragile engineering. Real-device infrastructure solves this by routing requests through actual smartphones with native browser stacks. Every request inherently has a consistent, authentic fingerprint because it is coming from a real device — there is nothing to spoof. Services like Archonum maintain a fleet of dedicated, factory-reset smartphones that produce native fingerprints, eliminating the cat-and-mouse game of fingerprint emulation. For SERP monitoring at scale, this translates to higher success rates and significantly less maintenance.

Tip: Calculate your true cost of the DIY approach: developer time spent on maintenance, the cost of failed requests (missed data), plus proxy costs. At scale, the engineering overhead often exceeds the cost of managed real-device infrastructure.

Scraping Google search results reliably in 2026 requires more than just making HTTP requests — you need realistic TLS fingerprints, careful rate limiting, robust CAPTCHA detection, and structured data export. The Python code in this guide gives you a working foundation for moderate-scale SERP monitoring. For production workloads where data completeness matters, the choice between maintaining a DIY scraper and using real-device infrastructure comes down to scale and reliability requirements. Start with the DIY approach to understand the problem space, then evaluate managed solutions when the maintenance burden starts outweighing the cost.

FAQ

Scraping publicly visible Google search results is generally considered legal for informational purposes. However, Google's Terms of Service prohibit automated queries. In practice, SERP scraping is a standard industry practice used by SEO tools, market researchers, and ad verification companies. Consult your legal team for jurisdiction-specific guidance.

The Custom Search API has three major limitations: it caps you at 100 free queries per day (10,000 with billing enabled), it returns different results than real SERPs, and it lacks many SERP features like featured snippets, People Also Ask, and local pack data. For SEO monitoring, you need the actual results users see.

It depends on your infrastructure. A single residential IP can handle 20-30 queries per hour before CAPTCHA risk increases. With 10 rotating residential proxies, 500-1,000 queries per day is realistic. Real-device infrastructure can scale to tens of thousands of daily queries with 99%+ success rates.

Yes. Use the 'gl' parameter to set the geographic context (e.g., gl=uk for United Kingdom). However, for accurate local results, you also need an IP address from that country. Residential proxies geolocated to your target market, or real devices physically located there, produce the most accurate localized SERPs.

Residential IPs solve the IP reputation problem but not the fingerprint problem. If your TLS fingerprint identifies you as a Python script (rather than a browser), Google will flag the request regardless of IP quality. Use curl_cffi with browser impersonation, or switch to real-device infrastructure where the fingerprint is inherently authentic.

These SERP features use different HTML structures than organic results and require separate parsers. Featured snippets are typically inside a div with class 'xpdopen' or similar, while People Also Ask boxes use expandable divs with 'related-question-pair' attributes. Add dedicated parsing functions for each feature type you need.

Get Reliable SERP Data Without the Engineering Overhead

Archonum routes your Google queries through real smartphones with native fingerprints. No TLS spoofing, no CAPTCHA walls, 99.9% success rates across every geolocation.

Talk to Sales