Amazon Sunglass Web Scraper Documentation

Project Overview

The Amazon Sunglass Web Scraper is a Python-based project designed to demonstrate web scraping techniques by extracting product information from Amazon India's search results for sunglasses. This project showcases how to use web scraping tools to collect structured data from e-commerce websites, handling dynamic content, pagination, and data storage.

Web Scraping Focus:

Technical Details

This project leverages several Python libraries to perform web scraping efficiently. Below is an overview of the tools and their roles:

The script targets the Amazon India search URL for "sunglass," scraping 7 pages and collecting approximately 390 products. It handles cases where ratings or prices may be missing by using conditional checks.

Code Structure

1. Main Script (app.py)

The core web scraping logic is implemented in app.py. It initializes a Chrome WebDriver, navigates search pages, extracts product data, and saves it to a CSV file.

                
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from tqdm import tqdm

driver = webdriver.Chrome()

driver.get("https://www.amazon.in/s?k=sunglass&crid=389J7CS8RFIK9&sprefix=sungla%2Caps%2C356&ref=nb_sb_noss_2")

html_data = BeautifulSoup(driver.page_source,"html.parser")

no_of_pages = int(html_data.find("span",{'class':'s-pagination-item s-pagination-disabled'}).text)

products = html_data.find_all("div",{'data-component-type':'s-search-result'})

titles = []
images = []
prices = []
ratings = []
urls = []

for i in tqdm(range(no_of_pages)):
    url = "https://www.amazon.in/s?k=sunglass&crid=389J7CS8RFIK9&sprefix=sungla%2Caps%2C356&ref=nb_sb_noss_2"+"&page="+str(i+1)
    driver.get(url)
    time.sleep(3)
    html_data = BeautifulSoup(driver.page_source,"html.parser")
    products = html_data.find_all("div",{'data-component-type':'s-search-result'})
    for product in products:
        title = product.h2.find("span").text
        titles.append(title)

        img = product.find("img",{'class':'s-image'})['src']
        images.append(img)

        rating = product.find('span',{'class':'a-icon-alt'})
        if rating is not None:
            rating = rating.text
            rating = float(rating[0:4])
        ratings.append(rating)

        price = product.find('span',{'class':'a-price-whole'})
        if price is not None:
            price = '₹'+price.text
        prices.append(price)

        product_link = 'https://www.amazon.in'+ product.find('a',{'class':'a-link-normal s-line-clamp-2 s-link-style a-text-normal'})['href']
        urls.append(product_link)

data = pd.DataFrame({
    'titles':titles,
    'images':images,
    'prices':prices,
    'ratings':ratings,
    'purls':urls
})

data.to_csv('sunglass.csv',index=False)
            

2. Makefile

The Makefile provides commands to set up a virtual environment, install dependencies, run the script, and clean up temporary files, enhancing the project's usability.

                
# Variables
PYTHON = python
PIP = pip
VENV = venv
MAIN_SCRIPT = app.py

.PHONY: all install run clean

all: install run

install:
	$(PYTHON) -m venv $(VENV)
	$(VENV)\Scripts\pip install --upgrade pip
	$(VENV)\Scripts\pip install -r requirements.txt

install-lib:
	$(VENV)\Scripts\pip install -r requirements.txt

run:
	powershell -Command "& { . $(VENV)\Scripts\Activate.ps1; python $(MAIN_SCRIPT) }"

clean:
	if exist $(VENV) rmdir /s /q $(VENV)
	if exist __pycache__ rmdir /s /q __pycache__
	del /s /q *.pyc *.pyo 2>nul
            

3. Requirements (requirements.txt)

Lists the Python libraries required for the web scraping project.

                
selenium
bs4
pandas
requests
numpy
tqdm
            

Setup Instructions

To run the web scraper, follow these steps:

  1. Install Python and Google Chrome.
  2. Download and configure ChromeDriver matching your Chrome version.
  3. Clone or download the project files.
  4. Run make install in the terminal to create a virtual environment and install dependencies.
  5. Run make run to execute the scraper.
  6. Check the generated sunglass.csv file for the scraped data.

Output

The scraper produces a CSV file (sunglass.csv) with the following columns:

The script processes 7 pages, yielding around 390 product entries, depending on Amazon's search results at runtime.

Ethical Considerations

Web scraping should be conducted responsibly: