How to extract amazon results with python and selenium?
In this assignment, we will try at the pagination having Selenium for a cycle using pages of Amazon results pages as well as save data in a json file.
What is Selenium?
Selenium is an open-source automation tool for
browsing, mainly used for testing web applications. This can mimic a user’s
inputs including mouse movements, key presses, and page navigation. In
addition, there are a lot of methods, which permit element’s selection on the
page. The main workhorse after the library is called Webdriver, which makes the
browser automation jobs very easy to do.
Essential Package Installation
For the assignment here, we would need
installing Selenium together with a few other packages.
Reminder: For this
development, we would utilize a Mac.
To install Selenium, you just require to type
the following in a terminal:
pip install selenium
To manage a webdriver, we will use a
webdriver-manager. Also, you might use Selenium to control the most renowned
web browsers including Chrome, Opera, Internet Explorer, Safari, and Firefox.
We will use Chrome.
pip install webdriver-manager
Then, we would need Selectorlib for downloading
and parsing HTML pages that we route for:
pip install selectorlib
Setting an Environment
After doing that, create a new folder on desktop
and add some files.
$ cd Desktop
$ mkdir amazon_scraper
$ cd amazon_scraper/
$ touch amazon_results_scraper.py
$ touch search_results_urls.txt
$ touch search_results_output.jsonl
You may also need to position the file named
“search_results.yml” in the project directory. A file might be used later to
grab data for all products on the page using CSS selectors. You can get the
file here.
Then, open a code editor and import the
following in a file called amazon_results_scraper.py.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
from selectorlib import Extractor
import requests
import json
import time
After that, run the function called
search_amazon that take the string for different items we require to search on
Amazon similar to an input:
def search_amazon(item):
#we will put our code
here.
Using webdriver-manager, you can easily install
the right version of a ChromeDriver:
def search_amazon(item):
driver = webdriver.Chrome(ChromeDriverManager().install())
How to Load a Page as well as Select Elements?
Selenium gives many methods for
selecting page elements. We might select elements by ID, XPath, name, link
text, class name, CSS Selector, and tag name. In addition, you can use
competent locators to target page elements associated to other fundamentals.
For diverse objectives, we would use ID, class name, and XPath. Let’s load the
Amazon homepage. Here is a driver element and type the following:
After that, you need to open Chrome browser and
navigate to the Amazon’s homepage, we need to have locations of the page
elements necessary to deal with. For various objectives, we require to:
- Response name of the item(s),
which we want to search in the search bar.
- After that, click on the search
button.
- Search through the result page
for different item(s).
- Repeat it with resulting pages.
After that, just right click on the search bar
and from the dropdown menu, just click on the inspect button. This will
redirect you to a section named browser developer tools. Then, click on the
icon:
After that, hover on the search bar as well as
click on search bar to locate different elements in the DOM:
This search bar is an ‘input’ element getting ID
of “twotabssearchtextbox”. We might interact with these items with Selenium
using find_element_by_id() method and then send text inputs in it using binding
.send_keys(‘text, which we want in the search box’) comprising:
search_box =
driver.find_element_by_id('twotabsearchtextbox').send_keys(item)
After that, it’s time to repeat related steps we
had taken to have the location of search boxes using the glass search button:
To click on items using Selenium, we primarily
need to select an item as well as chain .click() for the end of the statement:
search_button =
driver.find_element_by_id("nav-search-submit-text").click()
When we click on search, we require to wait for
the website for loading the preliminary page of results or we might get errors.
You could use:
import time
time.sleep(5)
Although, selenium is having a built-in method
to tell the driver to await for any specific amount of time:
driver.implicitly_wait(5)
When the hard section comes, we want to find out
how many outcome pages we have and repeat that through each page. A lot of
smart ways are there for doing that, although, we would apply a fast solution.
We would locate the item on any page that shows complete results as well as
select that with XPath.
Now, we can witness that complete result pages
are given in the 6th list elements
· (tag) about a list getting the
class “a-pagination”. To make it in a fun way, we would position two choices
within try or exclude block: getting one for the “a-pagination” tag and in
case, for whatever reason that fails, we might select an element below that
with the class named “a-last”.
Whereas using Selenium, a common error available is the
NoSuchElementExcemtion, which is thrown whereas Selenium only cannot have the
portion on a page. It might take place if an element hasn’t overloaded or if
the elements’ location on the page’s changes. We might catch the error and also
try and select something else if our preliminary option fails as we use the
try-except:
The time has come now to make a driver wait for
a few seconds:
driver.implicitly_wait(3)
We have selected an element on the page that
shows complete result pages and we want to repeat via every page, collecting
present URL for a list that we might later feed to an additional script. The
time has come to utilize num_page, have text from that element, cast it like
the integer and put it in ‘a’ for getting a loop:
Integrate an Amazon Search Results Pages Scraper within the
Script.
Just because we’ve recorded our
function to search our items and also repeat via results pages, we want to grab
and also save data. To do so, we would use an Amazon search results pages’
scraper from a xbyte.io-code.
The scrape function might
utilize URL’s in a text file to download HTML, extract relevant data including
name, pricing, and product URLs. Then, position it in ‘search_results.yml’
files. Under a search_amazon() function, place the following things:
search_amazon('phones')
To end with, we would position the driver code
to scrape(url) purpose afterwards we utilize search_amazon() functions:
And
that’s it! After running a code, a search_results_output.jsonl file might hold
data for all the items scraped from a search.
Here is a completed script:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
from selectorlib import Extractor
import requests
import json
import time
def search_amazon(item):
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.amazon.com')
search_box = driver.find_element_by_id('twotabsearchtextbox').send_keys(item)
search_button = driver.find_element_by_id("nav-search-submit-text").click()
driver.implicitly_wait(5)
try:
num_page = driver.find_element_by_xpath('//*[@class="a-pagination"]/li[6]')
except NoSuchElementException:
num_page = driver.find_element_by_class_name('a-last').click()
driver.implicitly_wait(3)
url_list = []
for i in range(int(num_page.text)):
page_ = i + 1
url_list.append(driver.current_url)
driver.implicitly_wait(4)
click_next = driver.find_element_by_class_name('a-last').click()
print("Page " + str(page_) + " grabbed")
driver.quit()
with open('search_results_urls.txt', 'w') as filehandle:
for result_page in url_list:
filehandle.write('%s\n' % result_page)
print("---DONE---")
def scrape(url):
headers = {
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.amazon.com/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
# Download the page using requests
print("Downloading %s"%url)
r = requests.get(url, headers=headers)
# Simple check to check if page was blocked (Usually 503)
if r.status_code > 500:
if "To discuss automated access to Amazon data please contact" in r.text:
print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
else:
print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
return None
# Pass the HTML of the page and create
return e.extract(r.text)
search_amazon('Macbook Pro') # <------ search query goes here.
# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file('search_results.yml')
# product_data = []
with open("search_results_urls.txt",'r') as urllist, open('search_results_output.jsonl','w') as outfile:
for url in urllist.read().splitlines():
data = scrape(url)
if data:
for product in data['products']:
product['search_url'] = url
print("Saving Product: %s"%product['title'].encode('utf8'))
json.dump(product,outfile)
outfile.write("\n")
# sleep(5)
Constraints
The script works extremely well
on broad searches, although would fail with particular searches with items that
return below 5 pages of the results. We might work to improve that in future for
scrape amazon product
data.
Disclaimer
Just because Amazon won’t need
auto extraction of the site and you require to consult.robots file whereas
doing the big-scale collection of data. The assignment was helpful as well as
made to learn objectives. So, in case, you are being blocked, you would have
been warned!
For more details, contact X-Byte
Enterprise Crawling or ask for a free quote!
For more
visit: https://www.xbyte.io/how-to-extract-amazon-results-with-python-and-selenium.php
Comments
Post a Comment