Scraping Data From Amazon
--
Introduction :
Data scraping, also known as web scraping, is the process of importing information from a website into a spreadsheet or local file saved on your computer. It’s one of the most efficient ways to get data from the web, and in some cases to channel that data to another website.
1) Create Virtual Environment:
- Install Virtual Environment:
python -m venv venv
2. Activate Virtual Environment:
source venv/bin/activate
2) Install Librarys:
- CSV :
pip Install csv
A CSV file (Comma Separated Values file) is a type of plain text file that uses specific structuring to arrange tabular data. Because it’s a plain text file, it can contain only actual text data — in other words, printable ASCII or Unicode characters.
- BeautifulSoup :
pip install bs4
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
- Selenium:
pip install selenium
Selenium is a portable framework for testing web applications. Selenium provides a playback tool for authoring functional tests without the need to learn a test scripting language (Selenium IDE). It also provides a test domain-specific language (Selenese) to write tests in a number of popular programming languages, including C#, Groovy, Java, Perl, PHP, Python, Ruby and Scala. The tests can then run against most modern web browsers. Selenium runs on Windows, Linux, and macOS. It is open-source software released under the Apache License 2.0.
3) Import Librarys :
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
4) Get url :
def get_url(search_term):
template = 'https://www.amazon.ca/s?k={}&ref=nb_sb_noss_2'
search_term = search_term.replace(' ', '+')
# add term query to url
u = template.format(search_term)
u += '&page={}'
return u
5) Extract Record :
def extract_record(item):
""" extract and return data from a single record"""
# description and url
atag = item.h2.a
description = atag.text.strip()
urls = 'https://www.amazon.ca/' + atag.get('href')
# price
try:
price_parent = item.find('span', 'a-price')
price = price_parent.find('span', 'a-offscreen').text
except AttributeError:
return
# rank and rating
try:
rating = item.i.text
review_count = item.find('span', {'class': 'a-size-base', 'dir': 'auto'}).text
except AttributeError:
rating = ''
review_count = ''
result = (description, price, rating, review_count, urls)
return result
6) main function and export to csv file
if __name__ == '__main__':
print('add an Article to search:\n')
search = input()
driver = webdriver.Chrome(ChromeDriverManager().install())
records = []
url = get_url(search)
for page in range(1, 21):
driver.get(url.format(page))
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('div', {'dat-component-type': 's-search-result'})
for i in results:
record = extract_record(i)
if record:
records.append(record)
driver.close()
# save data to csv file
with open('data.csv', 'w+', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Description', 'Price', 'Rating', 'ReviewCount', 'Url'])
writer.writerows(records)
Thank’s for your attention and i hope it’s helpful for you