How to Scrape Google Search Results with Python
Prerequisites
- • Python 3.10+ installed
- • Basic understanding of HTTP requests and HTML
- • Familiarity with CSS selectors
- • A residential proxy provider (for production use)
Understand Google's Anti-Bot Defenses
Tip: Google treats different query types differently. Informational queries ('what is web scraping') are less protected than commercial queries ('buy running shoes'). Test your scraper on informational queries first.
Set Up Your Python Environment
python3 -m venv google-scraper
source google-scraper/bin/activate
pip install curl_cffi selectolax fake-useragentTip: Pin your dependency versions in a requirements.txt. Anti-bot detection evolves, and you may need to roll back if an update introduces regressions in fingerprint impersonation.
Make Your First SERP Request
from curl_cffi import requests
from urllib.parse import urlencode
def search_google(query: str, page: int = 0, gl: str = "us", hl: str = "en") -> str:
"""Fetch a Google SERP page and return raw HTML."""
params = {
"q": query,
"start": page * 10,
"gl": gl,
"hl": hl,
"num": 10,
}
url = f"https://www.google.com/search?{urlencode(params)}"
session = requests.Session(impersonate="chrome")
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
}
response = session.get(url, headers=headers, timeout=15)
response.raise_for_status()
return response.text
# Test it
html = search_google("best web scraping tools 2026")
print(f"Got {len(html)} bytes")Tip: Always include a Referer header pointing to google.com. Requests arriving at a search results page without a referrer look suspicious — real users navigate from the Google homepage or their browser address bar.
Parse Organic Search Results
from selectolax.parser import HTMLParser
from dataclasses import dataclass, asdict
@dataclass
class SearchResult:
position: int
title: str
url: str
snippet: str
displayed_url: str
def parse_serp(html: str) -> list[SearchResult]:
"""Extract organic results from a Google SERP page."""
tree = HTMLParser(html)
results = []
position = 1
# Google wraps organic results in divs with class 'g' or nested structures
for block in tree.css("div.g"):
# Extract the link element
link_el = block.css_first("a[href]")
if not link_el:
continue
url = link_el.attributes.get("href", "")
# Skip non-http results (e.g., internal Google links)
if not url.startswith("http"):
continue
# Title is inside an h3 within the link
title_el = block.css_first("h3")
title = title_el.text(strip=True) if title_el else ""
# Snippet text — usually in a div with data-sncf attribute or nested spans
snippet_el = (
block.css_first("div[data-sncf]") or
block.css_first("div.VwiC3b") or
block.css_first("span.st")
)
snippet = snippet_el.text(strip=True) if snippet_el else ""
# Displayed URL (the green text)
cite_el = block.css_first("cite")
displayed_url = cite_el.text(strip=True) if cite_el else ""
if title and url:
results.append(SearchResult(
position=position,
title=title,
url=url,
snippet=snippet,
displayed_url=displayed_url,
))
position += 1
return results
# Usage
html = search_google("web scraping python tutorial")
results = parse_serp(html)
for r in results:
print(f"{r.position}. {r.title} — {r.url}")Tip: Google's class names change frequently. Build a monitoring system that alerts you when your parser returns zero results for a known query — that is usually a sign of a selector change, not a block.
Handle Pagination
import time
import random
def scrape_serp_pages(
query: str,
max_pages: int = 3,
gl: str = "us",
hl: str = "en",
) -> list[SearchResult]:
"""Scrape multiple pages of Google results for a query."""
all_results = []
for page in range(max_pages):
html = search_google(query, page=page, gl=gl, hl=hl)
# Check for CAPTCHA or block
if is_blocked(html):
print(f"Blocked on page {page + 1} for query: {query}")
break
results = parse_serp(html)
if not results:
# No results likely means we hit the last page
break
# Adjust positions to be absolute (not per-page)
offset = page * 10
for r in results:
r.position += offset
all_results.extend(results)
# Human-like delay between pages
if page < max_pages - 1:
delay = random.uniform(2.0, 6.0)
time.sleep(delay)
return all_results
def is_blocked(html: str) -> bool:
"""Detect if Google is serving a CAPTCHA or block page."""
signals = [
"detected unusual traffic",
"recaptcha",
"/sorry/index",
"systems have detected unusual traffic",
]
html_lower = html.lower()
return any(signal in html_lower for signal in signals)Tip: Scraping pages 1-3 covers positions 1-30, which accounts for over 95% of organic clicks. Going beyond page 5 rarely justifies the additional detection risk.
Deal with CAPTCHAs and Rate Limits
from collections import defaultdict
from datetime import datetime, timedelta
class SERPScraper:
def __init__(self, proxies: list[str]):
self.proxies = proxies
self.cooldowns: dict[str, datetime] = {}
self.query_count: dict[str, int] = defaultdict(int)
def get_available_proxy(self) -> str | None:
now = datetime.now()
available = [
p for p in self.proxies
if p not in self.cooldowns or now > self.cooldowns[p]
]
if not available:
return None
# Prefer proxies with fewer recent queries
return min(available, key=lambda p: self.query_count[p])
def retire_proxy(self, proxy: str, duration_minutes: int = 90):
self.cooldowns[proxy] = datetime.now() + timedelta(minutes=duration_minutes)
def search(self, query: str, proxy: str | None = None) -> list[SearchResult]:
if proxy is None:
proxy = self.get_available_proxy()
if proxy is None:
raise RuntimeError("No proxies available — all on cooldown")
session = requests.Session(impersonate="chrome")
proxies_dict = {"https": proxy, "http": proxy}
params = urlencode({"q": query, "gl": "us", "hl": "en", "num": 10})
url = f"https://www.google.com/search?{params}"
try:
response = session.get(
url,
proxies=proxies_dict,
headers={"Accept-Language": "en-US,en;q=0.9"},
timeout=15,
)
if is_blocked(response.text):
self.retire_proxy(proxy)
return []
self.query_count[proxy] += 1
return parse_serp(response.text)
except Exception as e:
print(f"Request failed: {e}")
self.retire_proxy(proxy, duration_minutes=30)
return []Tip: If you are consistently hitting CAPTCHAs with fewer than 20 queries per IP, the problem is likely your TLS fingerprint or header configuration, not your rate. Fix the fingerprint before adding more proxies.
Structure and Export Your Data
import json
import csv
from dataclasses import asdict
from datetime import date
def export_to_json(results: list[SearchResult], query: str, filepath: str):
data = {
"query": query,
"scraped_at": date.today().isoformat(),
"result_count": len(results),
"results": [asdict(r) for r in results],
}
with open(filepath, "w") as f:
json.dump(data, f, indent=2)
def export_to_csv(results: list[SearchResult], query: str, filepath: str):
with open(filepath, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["query", "position", "title", "url", "snippet", "displayed_url"])
for r in results:
writer.writerow([query, r.position, r.title, r.url, r.snippet, r.displayed_url])
# Example: daily SERP tracking
queries = [
"best project management software",
"web scraping tools comparison",
"python data extraction tutorial",
]
scraper = SERPScraper(proxies=["http://user:pass@proxy1:8080", "http://user:pass@proxy2:8080"])
for query in queries:
results = scraper.search(query)
if results:
safe_name = query.replace(" ", "_")[:50]
export_to_json(results, query, f"serp_{safe_name}.json")
print(f"Saved {len(results)} results for: {query}")
time.sleep(random.uniform(5, 15))Scale with Real-Device Infrastructure
Tip: Calculate your true cost of the DIY approach: developer time spent on maintenance, the cost of failed requests (missed data), plus proxy costs. At scale, the engineering overhead often exceeds the cost of managed real-device infrastructure.
FAQ
Scraping publicly visible Google search results is generally considered legal for informational purposes. However, Google's Terms of Service prohibit automated queries. In practice, SERP scraping is a standard industry practice used by SEO tools, market researchers, and ad verification companies. Consult your legal team for jurisdiction-specific guidance.
The Custom Search API has three major limitations: it caps you at 100 free queries per day (10,000 with billing enabled), it returns different results than real SERPs, and it lacks many SERP features like featured snippets, People Also Ask, and local pack data. For SEO monitoring, you need the actual results users see.
It depends on your infrastructure. A single residential IP can handle 20-30 queries per hour before CAPTCHA risk increases. With 10 rotating residential proxies, 500-1,000 queries per day is realistic. Real-device infrastructure can scale to tens of thousands of daily queries with 99%+ success rates.
Yes. Use the 'gl' parameter to set the geographic context (e.g., gl=uk for United Kingdom). However, for accurate local results, you also need an IP address from that country. Residential proxies geolocated to your target market, or real devices physically located there, produce the most accurate localized SERPs.
Residential IPs solve the IP reputation problem but not the fingerprint problem. If your TLS fingerprint identifies you as a Python script (rather than a browser), Google will flag the request regardless of IP quality. Use curl_cffi with browser impersonation, or switch to real-device infrastructure where the fingerprint is inherently authentic.
These SERP features use different HTML structures than organic results and require separate parsers. Featured snippets are typically inside a div with class 'xpdopen' or similar, while People Also Ask boxes use expandable divs with 'related-question-pair' attributes. Add dedicated parsing functions for each feature type you need.
Get Reliable SERP Data Without the Engineering Overhead
Archonum routes your Google queries through real smartphones with native fingerprints. No TLS spoofing, no CAPTCHA walls, 99.9% success rates across every geolocation.
Talk to Sales