How to Scrape LinkedIn Profiles and Jobs Without Getting Banned
Prerequisites
- • Python 3.10+ installed
- • Strong understanding of HTTP requests, cookies, and session management
- • Familiarity with browser DevTools and network inspection
- • A LinkedIn account (free or premium)
- • Residential proxy access for production use
Understand LinkedIn's Detection System
Tip: LinkedIn's detection is more account-focused than IP-focused. Losing an aged LinkedIn account with real connections is far more costly than burning a proxy IP. Protect accounts aggressively — use conservative rate limits and realistic behavior patterns.
Choose Your Scraping Surface: Public Pages vs. Authenticated API
from curl_cffi import requests
# Public page approach (no auth needed)
def fetch_public_profile(username: str, proxy: str | None = None) -> str:
"""Fetch a public LinkedIn profile page."""
url = f"https://www.linkedin.com/in/{username}/"
session = requests.Session(impersonate="chrome")
proxies = {"https": proxy, "http": proxy} if proxy else None
response = session.get(
url,
proxies=proxies,
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
},
timeout=15,
)
return response.textTip: If you only need basic profile data (name, headline, current company), the public page approach avoids account risk entirely. Use the authenticated API only when you need full profile details.
Authenticate and Capture Session Tokens
import json
from pathlib import Path
from curl_cffi import requests
class LinkedInClient:
BASE_URL = "https://www.linkedin.com"
API_URL = "https://www.linkedin.com/voyager/api"
def __init__(self, li_at: str, jsessionid: str, proxy: str | None = None):
self.session = requests.Session(impersonate="chrome")
self.proxy = proxy
self.proxies = {"https": proxy, "http": proxy} if proxy else None
# Set authentication cookies
self.session.cookies.set("li_at", li_at, domain=".linkedin.com")
self.session.cookies.set("JSESSIONID", jsessionid, domain=".linkedin.com")
self.headers = {
"csrf-token": jsessionid.strip('"'),
"Accept": "application/vnd.linkedin.normalized+json+2.1",
"Accept-Language": "en-US,en;q=0.9",
"x-li-lang": "en_US",
"x-li-track": json.dumps({"clientVersion": "1.13.8860", "osName": "web"}),
"x-restli-protocol-version": "2.0.0",
}
def _get(self, endpoint: str, params: dict | None = None) -> dict:
url = f"{self.API_URL}{endpoint}"
response = self.session.get(
url,
headers=self.headers,
params=params,
proxies=self.proxies,
timeout=15,
)
if response.status_code == 429:
raise Exception("Rate limited — back off")
if response.status_code == 401:
raise Exception("Session expired — re-authenticate")
response.raise_for_status()
return response.json()
@classmethod
def from_cookie_file(cls, filepath: str, proxy: str | None = None):
"""Load credentials from a JSON file."""
data = json.loads(Path(filepath).read_text())
return cls(li_at=data["li_at"], jsessionid=data["JSESSIONID"], proxy=proxy)Tip: Extract cookies from your browser using a browser extension like 'EditThisCookie' or from DevTools > Application > Cookies. Store them in a JSON file outside your code repository — never hardcode credentials.
Scrape LinkedIn Profiles via the Voyager API
from dataclasses import dataclass, field
@dataclass
class LinkedInProfile:
public_id: str
first_name: str
last_name: str
headline: str
location: str
industry: str
summary: str
experience: list[dict] = field(default_factory=list)
education: list[dict] = field(default_factory=list)
skills: list[str] = field(default_factory=list)
def get_profile(client: LinkedInClient, public_id: str) -> LinkedInProfile:
"""Fetch a full LinkedIn profile by public ID (URL slug)."""
endpoint = f"/identity/dash/profiles?q=memberIdentity&memberIdentity={public_id}"
data = client._get(endpoint)
elements = data.get("elements", [])
if not elements:
raise ValueError(f"Profile not found: {public_id}")
profile_data = elements[0]
return LinkedInProfile(
public_id=public_id,
first_name=profile_data.get("firstName", ""),
last_name=profile_data.get("lastName", ""),
headline=profile_data.get("headline", ""),
location=profile_data.get("geoLocationName", ""),
industry=profile_data.get("industryName", ""),
summary=profile_data.get("summary", ""),
)
def get_profile_experience(client: LinkedInClient, profile_urn: str) -> list[dict]:
"""Fetch work experience for a profile."""
endpoint = f"/identity/dash/profilePositionGroups?q=viewee&profileUrn={profile_urn}"
data = client._get(endpoint)
positions = []
for group in data.get("elements", []):
for position in group.get("profilePositionInPositionGroup", {}).get("elements", []):
pos = position.get("profilePosition", {})
positions.append({
"title": pos.get("title", ""),
"company": pos.get("companyName", ""),
"location": pos.get("locationName", ""),
"start_date": pos.get("timePeriod", {}).get("startDate", {}),
"end_date": pos.get("timePeriod", {}).get("endDate", {}),
"description": pos.get("description", ""),
})
return positionsTip: LinkedIn's API response structure includes 'included' and 'elements' arrays with cross-references via URN identifiers. For deeply nested data, you may need to resolve these references by matching URNs across the response payload.
Extract Job Listings
from dataclasses import dataclass
@dataclass
class JobListing:
job_id: str
title: str
company: str
location: str
posted_at: str
description: str
employment_type: str
seniority_level: str
apply_url: str
def search_jobs(
client: LinkedInClient,
keywords: str,
location: str = "",
start: int = 0,
limit: int = 25,
) -> list[JobListing]:
"""Search LinkedIn job listings."""
endpoint = "/voyagerJobsDashJobCards"
params = {
"decorationId": "com.linkedin.voyager.dash.deco.jobs.search.JobSearchCardsCollection-218",
"q": "jobSearch",
"query": f"(origin:JOB_SEARCH_PAGE_QUERY_EXPANSION,keywords:{keywords},locationUnion:(geoId:103644278),selectedFilters:(sortBy:List(DD)))",
"count": limit,
"start": start,
}
data = client._get(endpoint, params=params)
jobs = []
for element in data.get("elements", []):
job_card = element.get("jobCardUnion", {}).get("jobPostingCard", {})
if not job_card:
continue
jobs.append(JobListing(
job_id=job_card.get("jobPostingUrn", "").split(":")[-1],
title=job_card.get("primaryDescription", {}).get("text", ""),
company=job_card.get("primarySubtitle", {}).get("text", ""),
location=job_card.get("secondarySubtitle", {}).get("text", ""),
posted_at=job_card.get("tertiaryDescription", {}).get("text", ""),
description="", # Requires separate detail request
employment_type="",
seniority_level="",
apply_url="",
))
return jobs
def get_job_details(client: LinkedInClient, job_id: str) -> dict:
"""Fetch full details for a specific job posting."""
endpoint = f"/jobs/jobPostings/{job_id}"
data = client._get(endpoint)
return {
"title": data.get("title", ""),
"description": data.get("description", {}).get("text", ""),
"employment_type": data.get("formattedEmploymentStatus", ""),
"seniority_level": data.get("formattedExperienceLevel", ""),
"industries": data.get("formattedIndustries", ""),
"apply_url": data.get("applyMethod", {}).get("companyApplyUrl", ""),
"listed_at": data.get("listedAt", ""),
}Tip: Job listing descriptions are not included in search results — you need a separate request per job ID for the full description. Batch these requests carefully with delays to avoid rate limiting.
Implement Rate Limiting and Session Management
import time
import random
from datetime import datetime, timedelta
from dataclasses import dataclass, field
@dataclass
class AccountState:
client: LinkedInClient
daily_profile_views: int = 0
daily_searches: int = 0
session_start: datetime = field(default_factory=datetime.now)
last_request: datetime | None = None
cooldown_until: datetime | None = None
class RateLimitedScraper:
MAX_PROFILE_VIEWS = 50
MAX_SEARCHES = 25
MIN_DELAY = 3.0
MAX_DELAY = 8.0
SESSION_DURATION = timedelta(hours=3)
def __init__(self, accounts: list[AccountState]):
self.accounts = accounts
def _get_available_account(self, action: str) -> AccountState | None:
now = datetime.now()
for account in self.accounts:
if account.cooldown_until and now < account.cooldown_until:
continue
if now - account.session_start > self.SESSION_DURATION:
account.cooldown_until = now + timedelta(hours=2)
continue
if action == "profile" and account.daily_profile_views >= self.MAX_PROFILE_VIEWS:
continue
if action == "search" and account.daily_searches >= self.MAX_SEARCHES:
continue
return account
return None
def _wait(self, account: AccountState):
if account.last_request:
elapsed = (datetime.now() - account.last_request).total_seconds()
min_wait = max(0, self.MIN_DELAY - elapsed)
delay = random.uniform(min_wait, self.MAX_DELAY)
else:
delay = random.uniform(1.0, 3.0)
time.sleep(delay)
account.last_request = datetime.now()
def scrape_profile(self, public_id: str) -> LinkedInProfile | None:
account = self._get_available_account("profile")
if not account:
print("No accounts available for profile scraping")
return None
self._wait(account)
try:
profile = get_profile(account.client, public_id)
account.daily_profile_views += 1
return profile
except Exception as e:
print(f"Failed to scrape {public_id}: {e}")
if "429" in str(e):
account.cooldown_until = datetime.now() + timedelta(hours=4)
return NoneTip: Reset daily counters at midnight in the account's apparent timezone, not your server's timezone. An account 'based' in New York should not have its activity counter reset at midnight UTC.
Structure and Store Your Data
import json
import sqlite3
from dataclasses import asdict
from datetime import date
def init_database(db_path: str) -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS profiles (
public_id TEXT NOT NULL,
scraped_at DATE NOT NULL,
first_name TEXT,
last_name TEXT,
headline TEXT,
location TEXT,
industry TEXT,
summary TEXT,
raw_json TEXT,
PRIMARY KEY (public_id, scraped_at)
);
CREATE TABLE IF NOT EXISTS jobs (
job_id TEXT NOT NULL,
scraped_at DATE NOT NULL,
title TEXT,
company TEXT,
location TEXT,
description TEXT,
employment_type TEXT,
seniority_level TEXT,
raw_json TEXT,
PRIMARY KEY (job_id, scraped_at)
);
CREATE INDEX IF NOT EXISTS idx_profiles_scraped ON profiles(scraped_at);
CREATE INDEX IF NOT EXISTS idx_jobs_company ON jobs(company);
""")
return conn
def save_profile(conn: sqlite3.Connection, profile: LinkedInProfile):
conn.execute(
"INSERT OR REPLACE INTO profiles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)",
(
profile.public_id,
date.today().isoformat(),
profile.first_name,
profile.last_name,
profile.headline,
profile.location,
profile.industry,
profile.summary,
json.dumps(asdict(profile)),
),
)
conn.commit()
def save_job(conn: sqlite3.Connection, job: JobListing):
conn.execute(
"INSERT OR REPLACE INTO jobs VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)",
(
job.job_id,
date.today().isoformat(),
job.title,
job.company,
job.location,
job.description,
job.employment_type,
job.seniority_level,
json.dumps(asdict(job)),
),
)
conn.commit()Scale Beyond Account Limits with Real-Device Infrastructure
Tip: If your LinkedIn accounts are getting restricted despite conservative rate limits, the issue is almost certainly fingerprint-related rather than behavioral. Real-device solutions eliminate this class of detection entirely.
FAQ
The legal landscape is nuanced. The 2022 hiQ v. LinkedIn ruling established that scraping public profile data does not violate the CFAA. However, LinkedIn's User Agreement prohibits scraping, creating potential breach-of-contract liability. Scraping non-public data (requiring login) carries additional risk. Most companies engaged in LinkedIn data collection operate under the hiQ precedent for public data and accept the contractual risk. Consult your legal team.
Keep daily profile views under 50, searches under 25, and use random delays of 3-8 seconds between requests. Never access LinkedIn from datacenter IPs while logged in. Use a single, consistent residential IP per session. Do not perform actions that a human would not do — viewing 50 profiles without clicking any links or scrolling any page is a clear signal.
Yes, public profile pages are accessible without authentication, but the data is limited — typically just name, headline, current position, and education. Full work history, skills, and contact information require authentication. Public page scraping is lower risk but provides less data per request.
LinkedIn's Voyager API is an internal API not intended for third-party use, so it changes without notice. Major structural changes happen 2-3 times per year. Minor parameter and response format changes happen more frequently. Build your scraper to handle missing fields gracefully and monitor for parsing failures.
Residential proxies in the same geographic region as your LinkedIn account's stated location. Mobile carrier proxies work well but are more expensive. Never use datacenter proxies for authenticated LinkedIn access — they trigger immediate security challenges. For the highest reliability, real-device infrastructure provides both the IP and the fingerprint authenticity that LinkedIn's detection requires.
LinkedIn's official Marketing and Talent APIs are limited to approved partners and provide restricted data scopes. The approval process is slow and many use cases are not covered. For most data collection needs — competitive intelligence, market research, lead generation — scraping remains the practical approach because the official APIs simply do not provide the data you need.
Scale LinkedIn Data Collection Without Account Risk
Archonum's real-device infrastructure routes LinkedIn requests through actual smartphones with native fingerprints. Lower detection rates, longer account lifespans, and no fingerprint management overhead.
Talk to Sales