How to Web Scrape Amazon with Python?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    How to Web Scrape Amazon with Python?

    Want a fast, practical guide to scraping Amazon product data with Python? Here’s a concise walkthrough using requests + BeautifulSoup, with anti-bot tips, pagination, and clean parsing. For a working reference, check the GitHub repo: https://github.com/maivyly52-gif/ama...scraper-python


    What You’ll Learn

    • Send realistic HTTP requests (headers, delays)
    • Parse titles, prices, ratings, URLs with BeautifulSoup
    • Handle pagination safely
    • Reduce blocks with rotating user agents/proxies
    • Know ethical & legal guardrails


    Explore the full example code here: https://github.com/maivyly52-gif/ama...scraper-python






    pip install requests beautifulsoup4 fake-useragent







    (Proxy support? Add httpx/requests[socks] or a provider SDK.)


    Core Steps

    1) Build a “human-like” request





    import time, random, requests
    from fake_useragent import UserAgent

    ua = UserAgent()
    headers = {
    "User-Agent": ua.random,
    "Accept-Language": "en-US,en;q=0.9",
    }

    def fetch(url, *, retries=3, backoff=2):
    for i in range(retries):
    resp = requests.get(url, headers=headers, timeout=20)
    if resp.status_code == 200 and "Robot Check" not in resp.text:
    return resp.text
    time.sleep(backoff * (i + 1) + random.uniform(0.2, 1.1))
    return None








    2) Parse product cards





    from bs4 import BeautifulSoup

    def parse_search(html):
    soup = BeautifulSoup(html, "html.parser")
    items = []
    for card in soup.select("div.s-main-slot div[data-asin][data-component-type='s-search-result']"):
    asin = card.get("data-asin")
    title_el = card.select_one("h2 a span")
    price_whole = card.select_one("span.a-price > span.a-offscreen")
    rating = card.select_one("span.a-icon-alt")
    link_el = card.select_one("h2 a")
    if not (asin and title_el and link_el):
    continue
    items.append({
    "asin": asin,
    "title": title_el.get_text(strip=True),
    "price": price_whole.get_text(strip=True) if price_whole else None,
    "rating": rating.get_text(strip=True) if rating else None,
    "url": f"https://www.amazon.com{link_el['href'].split('?')[0]}",
    })
    return items








    3) Walk pagination (carefully)





    from urllib.parse import urlencode

    def search_amazon(query, pages=1):
    base = "https://www.amazon.com/s"
    results = []
    for page in range(1, pages + 1):
    params = {"k": query, "page": page}
    html = fetch(f"{base}?{urlencode(params)}")
    if not html:
    break
    results.extend(parse_search(html))
    time.sleep(random.uniform(1.2, 3.1)) # be gentle
    return results

    if __name__ == "__main__":
    data = search_amazon("wireless earbuds", pages=2)
    for row in data[:5]:
    print(row)








    Prefer a ready-to-run example? See the repo’s code paths and notes: https://github.com/maivyly52-gif/ama...scraper-python


    Anti-Bot Tips (Reduce Blocks)

    • Rotate User-Agents per request (fake-useragent or a maintained list).
    • Respectful delays (1–5s jitter) and low concurrency.
    • Proxies: residential/mobile work best; rotate IPs and subnets.
    • Fewer parameters in URLs; avoid suspicious patterns.
    • Fallback strategies: try different storefronts or narrower filters when you hit captchas.


    You’ll find a compact starter you can adapt in the GitHub project: https://github.com/maivyly52-gif/ama...scraper-python


    Data You Can Extract (Typical)

    • Title, price, list price, rating, review count
    • ASIN, product URL, image URL
    • Badges (e.g., “Best Seller”, “Amazon’s Choice”)
    • Availability snippets


    Legal & Ethical Notes

    • Check Amazon’s Terms of Use and your local laws before scraping.
    • Prefer official APIs when possible (e.g., Amazon Product Advertising API) for reliability.
    • Don’t overload servers; throttle requests and cache results.
    • Use scraped data only where you have the right to use it.


    Next Steps

    • Turn results into CSV/JSON for analysis.
    • Add retry with CAPTCHA detection and proxy rotation.
    • Expand parsing to product detail pages (features, bullets, specs).


    Dive deeper, copy the boilerplate, and tweak it for your use case here: https://github.com/maivyly52-gif/ama...scraper-python — and if you find it useful, ⭐ the repo and explore the code examples in https://github.com/maivyly52-gif/ama...scraper-python




    More...
Working...