Amazon Sunglass Web Scraper Documentation

Project Overview

The Amazon Sunglass Web Scraper is a Python-based project designed to demonstrate web scraping techniques by extracting product information from Amazon India's search results for sunglasses. This project showcases how to use web scraping tools to collect structured data from e-commerce websites, handling dynamic content, pagination, and data storage.

Web Scraping Focus:

Navigates multiple pages of Amazon search results using Selenium for browser automation.
Parses HTML content with BeautifulSoup to extract product details like titles, images, prices, ratings, and URLs.
Manages dynamic content loading by incorporating delays to ensure data is fully rendered.
Stores scraped data in a CSV file using pandas for easy analysis.
Utilizes a Makefile to streamline project setup and execution.

Technical Details

This project leverages several Python libraries to perform web scraping efficiently. Below is an overview of the tools and their roles:

selenium: Automates browser interactions to load dynamic content and navigate pages.
bs4 (BeautifulSoup): Parses HTML to extract specific elements like product titles and prices.
pandas: Structures scraped data into a DataFrame and exports it to CSV.
tqdm: Displays a progress bar for scraping multiple pages.
requests and numpy: Included as dependencies, though not directly used in the main script.

The script targets the Amazon India search URL for "sunglass," scraping 7 pages and collecting approximately 390 products. It handles cases where ratings or prices may be missing by using conditional checks.

Code Structure

1. Main Script (app.py)

The core web scraping logic is implemented in app.py. It initializes a Chrome WebDriver, navigates search pages, extracts product data, and saves it to a CSV file.

                
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from tqdm import tqdm

driver = webdriver.Chrome()

driver.get("https://www.amazon.in/s?k=sunglass&crid=389J7CS8RFIK9&sprefix=sungla%2Caps%2C356&ref=nb_sb_noss_2")

html_data = BeautifulSoup(driver.page_source,"html.parser")

no_of_pages = int(html_data.find("span",{'class':'s-pagination-item s-pagination-disabled'}).text)

products = html_data.find_all("div",{'data-component-type':'s-search-result'})

titles = []
images = []
prices = []
ratings = []
urls = []

for i in tqdm(range(no_of_pages)):
    url = "https://www.amazon.in/s?k=sunglass&crid=389J7CS8RFIK9&sprefix=sungla%2Caps%2C356&ref=nb_sb_noss_2"+"&page="+str(i+1)
    driver.get(url)
    time.sleep(3)
    html_data = BeautifulSoup(driver.page_source,"html.parser")
    products = html_data.find_all("div",{'data-component-type':'s-search-result'})
    for product in products:
        title = product.h2.find("span").text
        titles.append(title)

        img = product.find("img",{'class':'s-image'})['src']
        images.append(img)

        rating = product.find('span',{'class':'a-icon-alt'})
        if rating is not None:
            rating = rating.text
            rating = float(rating[0:4])
        ratings.append(rating)

        price = product.find('span',{'class':'a-price-whole'})
        if price is not None:
            price = '₹'+price.text
        prices.append(price)

        product_link = 'https://www.amazon.in'+ product.find('a',{'class':'a-link-normal s-line-clamp-2 s-link-style a-text-normal'})['href']
        urls.append(product_link)

data = pd.DataFrame({
    'titles':titles,
    'images':images,
    'prices':prices,
    'ratings':ratings,
    'purls':urls
})

data.to_csv('sunglass.csv',index=False)

2. Makefile

The Makefile provides commands to set up a virtual environment, install dependencies, run the script, and clean up temporary files, enhancing the project's usability.

                
# Variables
PYTHON = python
PIP = pip
VENV = venv
MAIN_SCRIPT = app.py

.PHONY: all install run clean

all: install run

install:
	$(PYTHON) -m venv $(VENV)
	$(VENV)\Scripts\pip install --upgrade pip
	$(VENV)\Scripts\pip install -r requirements.txt

install-lib:
	$(VENV)\Scripts\pip install -r requirements.txt

run:
	powershell -Command "& { . $(VENV)\Scripts\Activate.ps1; python $(MAIN_SCRIPT) }"

clean:
	if exist $(VENV) rmdir /s /q $(VENV)
	if exist __pycache__ rmdir /s /q __pycache__
	del /s /q *.pyc *.pyo 2>nul

3. Requirements (requirements.txt)

Lists the Python libraries required for the web scraping project.

                
selenium
bs4
pandas
requests
numpy
tqdm

Setup Instructions

To run the web scraper, follow these steps:

Install Python and Google Chrome.
Download and configure ChromeDriver matching your Chrome version.
Clone or download the project files.
Run make install in the terminal to create a virtual environment and install dependencies.
Run make run to execute the scraper.
Check the generated sunglass.csv file for the scraped data.

Output

The scraper produces a CSV file (sunglass.csv) with the following columns:

titles: Sunglass product names.
images: URLs to product images.
prices: Prices in Indian Rupees (₹).
ratings: Customer ratings out of 5 (or None if unavailable).
purls: Direct URLs to product pages.

The script processes 7 pages, yielding around 390 product entries, depending on Amazon's search results at runtime.

Ethical Considerations

Web scraping should be conducted responsibly:

Respect Amazon's terms of service and robots.txt.
Use delays (e.g., time.sleep(3)) to avoid overwhelming the server.
Limit scraping frequency to minimize server load.
Use scraped data for personal or educational purposes only, avoiding commercial exploitation without permission.