The Amazon Sunglass Web Scraper is a Python-based project designed to demonstrate web scraping techniques by extracting product information from Amazon India's search results for sunglasses. This project showcases how to use web scraping tools to collect structured data from e-commerce websites, handling dynamic content, pagination, and data storage.
Web Scraping Focus:
This project leverages several Python libraries to perform web scraping efficiently. Below is an overview of the tools and their roles:
selenium
: Automates browser interactions to load dynamic content and navigate pages.bs4
(BeautifulSoup): Parses HTML to extract specific elements like product titles and prices.pandas
: Structures scraped data into a DataFrame and exports it to CSV.tqdm
: Displays a progress bar for scraping multiple pages.requests
and numpy
: Included as dependencies, though not directly used in the main script.The script targets the Amazon India search URL for "sunglass," scraping 7 pages and collecting approximately 390 products. It handles cases where ratings or prices may be missing by using conditional checks.
The core web scraping logic is implemented in app.py
. It initializes a Chrome WebDriver, navigates search pages, extracts product data, and saves it to a CSV file.
import pandas as pd from selenium import webdriver from bs4 import BeautifulSoup import time from tqdm import tqdm driver = webdriver.Chrome() driver.get("https://www.amazon.in/s?k=sunglass&crid=389J7CS8RFIK9&sprefix=sungla%2Caps%2C356&ref=nb_sb_noss_2") html_data = BeautifulSoup(driver.page_source,"html.parser") no_of_pages = int(html_data.find("span",{'class':'s-pagination-item s-pagination-disabled'}).text) products = html_data.find_all("div",{'data-component-type':'s-search-result'}) titles = [] images = [] prices = [] ratings = [] urls = [] for i in tqdm(range(no_of_pages)): url = "https://www.amazon.in/s?k=sunglass&crid=389J7CS8RFIK9&sprefix=sungla%2Caps%2C356&ref=nb_sb_noss_2"+"&page="+str(i+1) driver.get(url) time.sleep(3) html_data = BeautifulSoup(driver.page_source,"html.parser") products = html_data.find_all("div",{'data-component-type':'s-search-result'}) for product in products: title = product.h2.find("span").text titles.append(title) img = product.find("img",{'class':'s-image'})['src'] images.append(img) rating = product.find('span',{'class':'a-icon-alt'}) if rating is not None: rating = rating.text rating = float(rating[0:4]) ratings.append(rating) price = product.find('span',{'class':'a-price-whole'}) if price is not None: price = '₹'+price.text prices.append(price) product_link = 'https://www.amazon.in'+ product.find('a',{'class':'a-link-normal s-line-clamp-2 s-link-style a-text-normal'})['href'] urls.append(product_link) data = pd.DataFrame({ 'titles':titles, 'images':images, 'prices':prices, 'ratings':ratings, 'purls':urls }) data.to_csv('sunglass.csv',index=False)
The Makefile provides commands to set up a virtual environment, install dependencies, run the script, and clean up temporary files, enhancing the project's usability.
# Variables PYTHON = python PIP = pip VENV = venv MAIN_SCRIPT = app.py .PHONY: all install run clean all: install run install: $(PYTHON) -m venv $(VENV) $(VENV)\Scripts\pip install --upgrade pip $(VENV)\Scripts\pip install -r requirements.txt install-lib: $(VENV)\Scripts\pip install -r requirements.txt run: powershell -Command "& { . $(VENV)\Scripts\Activate.ps1; python $(MAIN_SCRIPT) }" clean: if exist $(VENV) rmdir /s /q $(VENV) if exist __pycache__ rmdir /s /q __pycache__ del /s /q *.pyc *.pyo 2>nul
Lists the Python libraries required for the web scraping project.
selenium bs4 pandas requests numpy tqdm
To run the web scraper, follow these steps:
make install
in the terminal to create a virtual environment and install dependencies.make run
to execute the scraper.sunglass.csv
file for the scraped data.
The scraper produces a CSV file (sunglass.csv
) with the following columns:
titles
: Sunglass product names.images
: URLs to product images.prices
: Prices in Indian Rupees (₹).ratings
: Customer ratings out of 5 (or None if unavailable).purls
: Direct URLs to product pages.The script processes 7 pages, yielding around 390 product entries, depending on Amazon's search results at runtime.
Web scraping should be conducted responsibly:
time.sleep(3)
) to avoid overwhelming the server.