How to Scrape Amazon Without Getting Blocked

intermediate20 minutes

Prerequisites

• Basic Python knowledge (3.8+)
• Familiarity with HTTP requests and HTML parsing
• A proxy provider or residential IP source
• pip installed for package management

Amazon is one of the most heavily protected websites on the internet. Their anti-bot system combines behavioral analysis, device fingerprinting, rate limiting, and CAPTCHA challenges to block automated access. Despite this, Amazon product data is among the most valuable in e-commerce — pricing, reviews, bestseller rankings, and inventory data drive critical business decisions. This guide walks through a practical, step-by-step approach to scraping Amazon reliably in 2026, covering the techniques that actually work and the common mistakes that get you blocked.

Understand Amazon's Anti-Bot Layers

Before writing any code, you need to understand what you are up against. Amazon's detection system operates on multiple layers: 1. **IP reputation** — Datacenter IPs are flagged almost immediately. Residential IPs have a longer lifespan but can still be flagged based on request patterns. 2. **TLS fingerprinting** — Amazon inspects TLS handshake characteristics (JA3/JA4 fingerprints) to identify automated clients. Standard Python requests libraries have distinctive TLS fingerprints. 3. **Browser fingerprinting** — JavaScript-based checks examine canvas rendering, WebGL, audio context, and dozens of other signals. 4. **Behavioral analysis** — Request timing, navigation patterns, and mouse/keyboard events are analyzed on rendered pages. 5. **Rate limiting** — Too many requests from the same IP or session trigger progressive challenges (CAPTCHAs, then blocks). Your scraping strategy needs to address all five layers to maintain reliable access.

Tip: Amazon's detection is adaptive — techniques that work today may stop working in weeks. Build your scraper with modularity so you can swap components (proxy provider, fingerprint strategy, parsing logic) independently.

Set Up Your Python Environment

Start by creating a clean virtual environment and installing the necessary packages. We will use `curl_cffi` instead of `requests` because it produces realistic TLS fingerprints that match real browsers.

python3 -m venv amazon-scraper
source amazon-scraper/bin/activate
pip install curl_cffi selectolax fake-useragent

Tip: Avoid using the standard `requests` library directly — its TLS fingerprint is well-known to anti-bot systems. `curl_cffi` impersonates real browser TLS stacks.

Configure Realistic Request Headers

Your request headers must be internally consistent and match what a real browser would send. The most common mistake is sending a Chrome user-agent with headers that do not match Chrome's header order or values.

from curl_cffi import requests
from fake_useragent import UserAgent
import random
import time

ua = UserAgent(browsers=['Chrome'], os=['Windows', 'macOS'])

def get_headers():
    user_agent = ua.random
    return {
        'User-Agent': user_agent,
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'Cache-Control': 'max-age=0',
    }

def scrape_amazon(url, proxy=None):
    session = requests.Session(impersonate='chrome')
    proxies = {'https': proxy} if proxy else None
    
    response = session.get(
        url,
        headers=get_headers(),
        proxies=proxies,
        timeout=30
    )
    return response

Tip: Header order matters. Real Chrome sends headers in a specific order, and some anti-bot systems check this. The `curl_cffi` library with `impersonate='chrome'` handles header ordering automatically.

Implement Proxy Rotation with Smart Retry Logic

Never scrape Amazon from a single IP. You need a pool of residential or mobile IPs, and your rotation strategy should be smarter than random assignment. Key principles: - Assign one IP per product category or search term to mimic a real user browsing within a category. - If an IP receives a CAPTCHA, retire it for at least 30 minutes. - Use mobile carrier IPs when available — they have the highest trust scores on Amazon. - Track success rates per IP and remove consistently poor performers.

import random
from collections import defaultdict
from datetime import datetime, timedelta

class ProxyManager:
    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self.cooldowns: dict[str, datetime] = {}
        self.failures: dict[str, int] = defaultdict(int)
    
    def get_proxy(self) -> str | None:
        available = [
            p for p in self.proxies
            if p not in self.cooldowns 
            or datetime.now() > self.cooldowns[p]
        ]
        if not available:
            return None
        # Weight toward proxies with fewer failures
        weights = [1 / (self.failures[p] + 1) for p in available]
        return random.choices(available, weights=weights, k=1)[0]
    
    def report_failure(self, proxy: str, is_captcha: bool = False):
        self.failures[proxy] += 1
        cooldown = timedelta(minutes=30) if is_captcha else timedelta(minutes=5)
        self.cooldowns[proxy] = datetime.now() + cooldown
    
    def report_success(self, proxy: str):
        self.failures[proxy] = max(0, self.failures[proxy] - 1)

Add Human-Like Request Timing

Uniform request intervals are a dead giveaway. Real users do not request pages at exact 2-second intervals. Implement randomized delays with a realistic distribution.

import random
import time

def human_delay(min_seconds=1.5, max_seconds=5.0):
    """Generate a human-like delay using a log-normal distribution."""
    # Log-normal produces occasional longer pauses, mimicking
    # a real user who sometimes reads the page longer
    delay = random.lognormvariate(0.5, 0.5)
    delay = max(min_seconds, min(delay, max_seconds))
    time.sleep(delay)

def scrape_product_list(urls: list[str], proxy_manager: ProxyManager):
    results = []
    for url in urls:
        proxy = proxy_manager.get_proxy()
        if not proxy:
            print('All proxies on cooldown. Waiting...')
            time.sleep(60)
            proxy = proxy_manager.get_proxy()
        
        try:
            response = scrape_amazon(url, proxy=proxy)
            if 'captcha' in response.text.lower():
                proxy_manager.report_failure(proxy, is_captcha=True)
                continue
            proxy_manager.report_success(proxy)
            results.append(response.text)
        except Exception as e:
            proxy_manager.report_failure(proxy)
            print(f'Error: {e}')
        
        human_delay()
    return results

Tip: For high-volume scraping, consider batching requests by category. Scrape all products in one category before moving to the next, with a longer pause between categories. This mirrors real browsing behavior.

Parse Product Data Efficiently

Use `selectolax` instead of BeautifulSoup for parsing — it is significantly faster and handles Amazon's HTML well. Focus on extracting structured data from the key page elements.

from selectolax.parser import HTMLParser

def parse_product_page(html: str) -> dict:
    tree = HTMLParser(html)
    
    def text(selector: str) -> str:
        node = tree.css_first(selector)
        return node.text(strip=True) if node else ''
    
    def attr(selector: str, attribute: str) -> str:
        node = tree.css_first(selector)
        return node.attributes.get(attribute, '') if node else ''
    
    return {
        'title': text('#productTitle'),
        'price': text('.a-price .a-offscreen'),
        'rating': text('#acrPopover .a-icon-alt'),
        'review_count': text('#acrCustomerReviewText'),
        'availability': text('#availability span'),
        'brand': text('#bylineInfo'),
        'asin': attr('[data-asin]', 'data-asin'),
        'image_url': attr('#landingImage', 'src'),
        'features': [
            li.text(strip=True) 
            for li in tree.css('#feature-bullets li span.a-list-item')
        ],
    }

Tip: Amazon frequently changes its HTML structure. Build your parser with fallback selectors and log warnings when primary selectors fail to match — this gives you early warning of layout changes.

Handle CAPTCHAs and Blocks Gracefully

Even with good infrastructure, you will occasionally encounter CAPTCHAs or soft blocks. Your scraper needs to detect these and respond appropriately rather than continuing to hammer the same endpoint. Key detection patterns: - **CAPTCHA page**: Contains 'captcha' in the response body or a redirect to `/errors/validateCaptcha` - **Dog page**: Amazon's generic error page showing a dog photo — means your session is flagged - **503 errors**: Temporary rate limiting — back off and retry - **Robot check redirect**: URL changes to `/gp/product/robot-check` When detected, retire the current proxy/session and switch to fresh infrastructure. Do not retry the same request from the same IP.

def detect_block(response) -> str | None:
    """Returns block type or None if response is clean."""
    if response.status_code == 503:
        return 'rate_limited'
    
    body = response.text.lower()
    if 'captcha' in body or 'validatecaptcha' in body:
        return 'captcha'
    if 'sorry, we just need to make sure' in body:
        return 'robot_check'
    if 'api-services-support@amazon.com' in body:
        return 'hard_block'
    if response.url and 'robot-check' in str(response.url):
        return 'robot_check'
    
    return None

Consider Real-Device Infrastructure for Production

The techniques above work for moderate-scale scraping, but production workloads hitting Amazon daily will eventually face diminishing returns from proxy rotation alone. Amazon's fingerprinting has become sophisticated enough that the gap between emulated and real browser environments is a primary detection vector. For production-grade Amazon scraping, consider infrastructure that uses real devices rather than emulated browsers. Services like Archonum route requests through actual smartphones with native fingerprints, which eliminates the TLS mismatch, canvas fingerprint, and WebGL inconsistencies that trip up even well-configured headless browsers. This approach trades the complexity of managing fingerprint consistency for a simpler integration where every request is inherently authentic. The decision between DIY scraping (as outlined in steps 1-7) and managed real-device infrastructure usually comes down to scale and reliability requirements. If you need data from Amazon for business-critical decisions and can tolerate occasional gaps, the DIY approach works. If you need 99%+ reliability on a daily basis, the engineering overhead of maintaining a DIY Amazon scraper often exceeds the cost of purpose-built infrastructure.

Tip: Track your scraper's success rate weekly. If it drops below 90%, your fingerprint or proxy strategy likely needs updating. A sudden drop usually means Amazon has updated their detection, while a gradual decline suggests your IPs are being burned.

Scraping Amazon reliably in 2026 requires a multi-layered approach: realistic TLS fingerprints, properly configured headers, smart proxy rotation, human-like timing, and robust error handling. The code examples in this guide provide a solid foundation, but remember that Amazon actively evolves their anti-bot systems. What works today may need adjustment in a few months. The most resilient approach is to build your scraper with modular components that can be swapped independently, and to monitor success rates continuously so you can react quickly when detection patterns change. For teams where Amazon data is mission-critical, investing in real-device infrastructure eliminates the cat-and-mouse game entirely.

FAQ

Scraping publicly available Amazon product data is generally legal under US law following the hiQ v. LinkedIn precedent. However, Amazon's Terms of Service prohibit automated access, so there is a contractual risk. Most e-commerce companies treat Amazon scraping as a standard business practice, but you should consult your own legal counsel for your specific jurisdiction and use case.

There is no fixed limit — it depends entirely on your infrastructure. With a single residential IP, you might manage 200-500 requests per day before encountering CAPTCHAs. With a rotating pool of 100+ residential IPs and proper timing, 10,000-50,000 requests per day is achievable. Real-device infrastructure can sustain even higher volumes with better success rates.

Residential IPs are necessary but not sufficient. Amazon also checks TLS fingerprints, JavaScript execution environment, header consistency, and behavioral patterns. If you are using a standard Python HTTP library through a residential proxy, the TLS fingerprint alone identifies you as non-browser traffic. Use curl_cffi or a real browser automation tool.

For product pages and search results, well-configured HTTP requests with curl_cffi are faster and more efficient. Use headless Chrome only for pages that require JavaScript rendering to display data (some Amazon pages lazy-load content). Be aware that default Puppeteer and Playwright configurations are easily detected — you need stealth plugins and proper configuration.

Amazon reviews are paginated and require navigating through multiple pages per product. Use the /product-reviews/ URL pattern with pageNumber parameter for direct access. The same anti-bot techniques apply, but review pages tend to be less aggressively protected than search results or product detail pages.

Mobile carrier proxies have the highest trust scores on Amazon, followed by residential proxies. ISP proxies (static residential) offer a good balance of speed and trust. Datacenter proxies are blocked almost immediately. For the highest success rates, real-device solutions that use actual smartphone connections provide both the IP trust and fingerprint authenticity that Amazon's detection looks for.

Skip the Cat-and-Mouse Game

Archonum's real-device infrastructure delivers 99.9% success rates on Amazon with native smartphone fingerprints. No TLS spoofing, no fingerprint emulation, no blocked requests.

Talk to Sales