How to Scrape Amazon Without Getting Blocked
Prerequisites
- • Basic Python knowledge (3.8+)
- • Familiarity with HTTP requests and HTML parsing
- • A proxy provider or residential IP source
- • pip installed for package management
Understand Amazon's Anti-Bot Layers
Tip: Amazon's detection is adaptive — techniques that work today may stop working in weeks. Build your scraper with modularity so you can swap components (proxy provider, fingerprint strategy, parsing logic) independently.
Set Up Your Python Environment
python3 -m venv amazon-scraper
source amazon-scraper/bin/activate
pip install curl_cffi selectolax fake-useragentTip: Avoid using the standard `requests` library directly — its TLS fingerprint is well-known to anti-bot systems. `curl_cffi` impersonates real browser TLS stacks.
Configure Realistic Request Headers
from curl_cffi import requests
from fake_useragent import UserAgent
import random
import time
ua = UserAgent(browsers=['Chrome'], os=['Windows', 'macOS'])
def get_headers():
user_agent = ua.random
return {
'User-Agent': user_agent,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0',
}
def scrape_amazon(url, proxy=None):
session = requests.Session(impersonate='chrome')
proxies = {'https': proxy} if proxy else None
response = session.get(
url,
headers=get_headers(),
proxies=proxies,
timeout=30
)
return responseTip: Header order matters. Real Chrome sends headers in a specific order, and some anti-bot systems check this. The `curl_cffi` library with `impersonate='chrome'` handles header ordering automatically.
Implement Proxy Rotation with Smart Retry Logic
import random
from collections import defaultdict
from datetime import datetime, timedelta
class ProxyManager:
def __init__(self, proxies: list[str]):
self.proxies = proxies
self.cooldowns: dict[str, datetime] = {}
self.failures: dict[str, int] = defaultdict(int)
def get_proxy(self) -> str | None:
available = [
p for p in self.proxies
if p not in self.cooldowns
or datetime.now() > self.cooldowns[p]
]
if not available:
return None
# Weight toward proxies with fewer failures
weights = [1 / (self.failures[p] + 1) for p in available]
return random.choices(available, weights=weights, k=1)[0]
def report_failure(self, proxy: str, is_captcha: bool = False):
self.failures[proxy] += 1
cooldown = timedelta(minutes=30) if is_captcha else timedelta(minutes=5)
self.cooldowns[proxy] = datetime.now() + cooldown
def report_success(self, proxy: str):
self.failures[proxy] = max(0, self.failures[proxy] - 1)Add Human-Like Request Timing
import random
import time
def human_delay(min_seconds=1.5, max_seconds=5.0):
"""Generate a human-like delay using a log-normal distribution."""
# Log-normal produces occasional longer pauses, mimicking
# a real user who sometimes reads the page longer
delay = random.lognormvariate(0.5, 0.5)
delay = max(min_seconds, min(delay, max_seconds))
time.sleep(delay)
def scrape_product_list(urls: list[str], proxy_manager: ProxyManager):
results = []
for url in urls:
proxy = proxy_manager.get_proxy()
if not proxy:
print('All proxies on cooldown. Waiting...')
time.sleep(60)
proxy = proxy_manager.get_proxy()
try:
response = scrape_amazon(url, proxy=proxy)
if 'captcha' in response.text.lower():
proxy_manager.report_failure(proxy, is_captcha=True)
continue
proxy_manager.report_success(proxy)
results.append(response.text)
except Exception as e:
proxy_manager.report_failure(proxy)
print(f'Error: {e}')
human_delay()
return resultsTip: For high-volume scraping, consider batching requests by category. Scrape all products in one category before moving to the next, with a longer pause between categories. This mirrors real browsing behavior.
Parse Product Data Efficiently
from selectolax.parser import HTMLParser
def parse_product_page(html: str) -> dict:
tree = HTMLParser(html)
def text(selector: str) -> str:
node = tree.css_first(selector)
return node.text(strip=True) if node else ''
def attr(selector: str, attribute: str) -> str:
node = tree.css_first(selector)
return node.attributes.get(attribute, '') if node else ''
return {
'title': text('#productTitle'),
'price': text('.a-price .a-offscreen'),
'rating': text('#acrPopover .a-icon-alt'),
'review_count': text('#acrCustomerReviewText'),
'availability': text('#availability span'),
'brand': text('#bylineInfo'),
'asin': attr('[data-asin]', 'data-asin'),
'image_url': attr('#landingImage', 'src'),
'features': [
li.text(strip=True)
for li in tree.css('#feature-bullets li span.a-list-item')
],
}Tip: Amazon frequently changes its HTML structure. Build your parser with fallback selectors and log warnings when primary selectors fail to match — this gives you early warning of layout changes.
Handle CAPTCHAs and Blocks Gracefully
def detect_block(response) -> str | None:
"""Returns block type or None if response is clean."""
if response.status_code == 503:
return 'rate_limited'
body = response.text.lower()
if 'captcha' in body or 'validatecaptcha' in body:
return 'captcha'
if 'sorry, we just need to make sure' in body:
return 'robot_check'
if 'api-services-support@amazon.com' in body:
return 'hard_block'
if response.url and 'robot-check' in str(response.url):
return 'robot_check'
return NoneConsider Real-Device Infrastructure for Production
Tip: Track your scraper's success rate weekly. If it drops below 90%, your fingerprint or proxy strategy likely needs updating. A sudden drop usually means Amazon has updated their detection, while a gradual decline suggests your IPs are being burned.
FAQ
Scraping publicly available Amazon product data is generally legal under US law following the hiQ v. LinkedIn precedent. However, Amazon's Terms of Service prohibit automated access, so there is a contractual risk. Most e-commerce companies treat Amazon scraping as a standard business practice, but you should consult your own legal counsel for your specific jurisdiction and use case.
There is no fixed limit — it depends entirely on your infrastructure. With a single residential IP, you might manage 200-500 requests per day before encountering CAPTCHAs. With a rotating pool of 100+ residential IPs and proper timing, 10,000-50,000 requests per day is achievable. Real-device infrastructure can sustain even higher volumes with better success rates.
Residential IPs are necessary but not sufficient. Amazon also checks TLS fingerprints, JavaScript execution environment, header consistency, and behavioral patterns. If you are using a standard Python HTTP library through a residential proxy, the TLS fingerprint alone identifies you as non-browser traffic. Use curl_cffi or a real browser automation tool.
For product pages and search results, well-configured HTTP requests with curl_cffi are faster and more efficient. Use headless Chrome only for pages that require JavaScript rendering to display data (some Amazon pages lazy-load content). Be aware that default Puppeteer and Playwright configurations are easily detected — you need stealth plugins and proper configuration.
Amazon reviews are paginated and require navigating through multiple pages per product. Use the /product-reviews/ URL pattern with pageNumber parameter for direct access. The same anti-bot techniques apply, but review pages tend to be less aggressively protected than search results or product detail pages.
Mobile carrier proxies have the highest trust scores on Amazon, followed by residential proxies. ISP proxies (static residential) offer a good balance of speed and trust. Datacenter proxies are blocked almost immediately. For the highest success rates, real-device solutions that use actual smartphone connections provide both the IP trust and fingerprint authenticity that Amazon's detection looks for.
Skip the Cat-and-Mouse Game
Archonum's real-device infrastructure delivers 99.9% success rates on Amazon with native smartphone fingerprints. No TLS spoofing, no fingerprint emulation, no blocked requests.
Talk to Sales