How to extract amazon results with python and selenium?

In this assignment, we will try at the pagination having Selenium for a cycle using pages of Amazon results pages as well as save data in a json file.

What is Selenium?

Selenium is an open-source automation tool for browsing, mainly used for testing web applications. This can mimic a user’s inputs including mouse movements, key presses, and page navigation. In addition, there are a lot of methods, which permit element’s selection on the page. The main workhorse after the library is called Webdriver, which makes the browser automation jobs very easy to do.

Essential Package Installation

For the assignment here, we would need installing Selenium together with a few other packages.

Reminder: For this development, we would utilize a Mac.

To install Selenium, you just require to type the following in a terminal:

pip install selenium

To manage a webdriver, we will use a webdriver-manager. Also, you might use Selenium to control the most renowned web browsers including Chrome, Opera, Internet Explorer, Safari, and Firefox. We will use Chrome.

pip install webdriver-manager

Then, we would need Selectorlib for downloading and parsing HTML pages that we route for:

pip install selectorlib

Setting an Environment

After doing that, create a new folder on desktop and add some files.

$ cd Desktop

$ mkdir amazon_scraper

$ cd amazon_scraper/

$ touch amazon_results_scraper.py

$ touch search_results_urls.txt

$ touch search_results_output.jsonl

You may also need to position the file named “search_results.yml” in the project directory. A file might be used later to grab data for all products on the page using CSS selectors. You can get the file here.

Then, open a code editor and import the following in a file called amazon_results_scraper.py.

from selenium import webdriver

from webdriver_manager.chrome import ChromeDriverManager

from selenium.common.exceptions import NoSuchElementException

from selectorlib import Extractor

import requests

import json

import time

After that, run the function called search_amazon that take the string for different items we require to search on Amazon similar to an input:

def search_amazon(item):

#we will put our code here.

Using webdriver-manager, you can easily install the right version of a ChromeDriver:

def search_amazon(item):

    driver = webdriver.Chrome(ChromeDriverManager().install())

How to Load a Page as well as Select Elements?

Selenium gives many methods for selecting page elements. We might select elements by ID, XPath, name, link text, class name, CSS Selector, and tag name. In addition, you can use competent locators to target page elements associated to other fundamentals. For diverse objectives, we would use ID, class name, and XPath. Let’s load the Amazon homepage. Here is a driver element and type the following:

After that, you need to open Chrome browser and navigate to the Amazon’s homepage, we need to have locations of the page elements necessary to deal with. For various objectives, we require to:

Response name of the item(s), which we want to search in the search bar.
After that, click on the search button.
Search through the result page for different item(s).
Repeat it with resulting pages.

After that, just right click on the search bar and from the dropdown menu, just click on the inspect button. This will redirect you to a section named browser developer tools. Then, click on the icon:

After that, hover on the search bar as well as click on search bar to locate different elements in the DOM:

This search bar is an ‘input’ element getting ID of “twotabssearchtextbox”. We might interact with these items with Selenium using find_element_by_id() method and then send text inputs in it using binding .send_keys(‘text, which we want in the search box’) comprising:

search_box = driver.find_element_by_id('twotabsearchtextbox').send_keys(item)

After that, it’s time to repeat related steps we had taken to have the location of search boxes using the glass search button:

To click on items using Selenium, we primarily need to select an item as well as chain .click() for the end of the statement:

search_button = driver.find_element_by_id("nav-search-submit-text").click()

When we click on search, we require to wait for the website for loading the preliminary page of results or we might get errors. You could use:

import time

time.sleep(5)

Although, selenium is having a built-in method to tell the driver to await for any specific amount of time:

driver.implicitly_wait(5)

When the hard section comes, we want to find out how many outcome pages we have and repeat that through each page. A lot of smart ways are there for doing that, although, we would apply a fast solution. We would locate the item on any page that shows complete results as well as select that with XPath.

how-to-load-a-page-as-well-as-select-elements-3

Now, we can witness that complete result pages are given in the 6th list elements

· (tag) about a list getting the class “a-pagination”. To make it in a fun way, we would position two choices within try or exclude block: getting one for the “a-pagination” tag and in case, for whatever reason that fails, we might select an element below that with the class named “a-last”.

Whereas using Selenium, a common error available is the NoSuchElementExcemtion, which is thrown whereas Selenium only cannot have the portion on a page. It might take place if an element hasn’t overloaded or if the elements’ location on the page’s changes. We might catch the error and also try and select something else if our preliminary option fails as we use the try-except:

The time has come now to make a driver wait for a few seconds:

driver.implicitly_wait(3)

We have selected an element on the page that shows complete result pages and we want to repeat via every page, collecting present URL for a list that we might later feed to an additional script. The time has come to utilize num_page, have text from that element, cast it like the integer and put it in ‘a’ for getting a loop:

Integrate an Amazon Search Results Pages Scraper within the Script.

Just because we’ve recorded our function to search our items and also repeat via results pages, we want to grab and also save data. To do so, we would use an Amazon search results pages’ scraper from a xbyte.io-code.

The scrape function might utilize URL’s in a text file to download HTML, extract relevant data including name, pricing, and product URLs. Then, position it in ‘search_results.yml’ files. Under a search_amazon() function, place the following things:

search_amazon('phones')

To end with, we would position the driver code to scrape(url) purpose afterwards we utilize search_amazon() functions:

And that’s it! After running a code, a search_results_output.jsonl file might hold data for all the items scraped from a search.

Here is a completed script:

from selenium import webdriver

from webdriver_manager.chrome import ChromeDriverManager

from selenium.common.exceptions import NoSuchElementException

from selectorlib import Extractor

import requests

import json

import time

def search_amazon(item):

    driver = webdriver.Chrome(ChromeDriverManager().install())

    driver.get('https://www.amazon.com')

    search_box = driver.find_element_by_id('twotabsearchtextbox').send_keys(item)

    search_button = driver.find_element_by_id("nav-search-submit-text").click()

    driver.implicitly_wait(5)

    try:

        num_page = driver.find_element_by_xpath('//*[@class="a-pagination"]/li[6]')

    except NoSuchElementException:

        num_page = driver.find_element_by_class_name('a-last').click()

    driver.implicitly_wait(3)

    url_list = []

    for i in range(int(num_page.text)):

        page_ = i + 1

        url_list.append(driver.current_url)

        driver.implicitly_wait(4)

        click_next = driver.find_element_by_class_name('a-last').click()

        print("Page " + str(page_) + " grabbed")

    driver.quit()

    with open('search_results_urls.txt', 'w') as filehandle:

        for result_page in url_list:

            filehandle.write('%s\n' % result_page)

    print("---DONE---")

def scrape(url):

    headers = {

        'dnt': '1',

        'upgrade-insecure-requests': '1',

        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',

        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',

        'sec-fetch-site': 'same-origin',

        'sec-fetch-mode': 'navigate',

        'sec-fetch-user': '?1',

        'sec-fetch-dest': 'document',

        'referer': 'https://www.amazon.com/',

        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',

    # Download the page using requests

    print("Downloading %s"%url)

    r = requests.get(url, headers=headers)

    # Simple check to check if page was blocked (Usually 503)

    if r.status_code > 500:

        if "To discuss automated access to Amazon data please contact" in r.text:

            print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)

        else:

            print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))

        return None

    # Pass the HTML of the page and create

    return e.extract(r.text)

search_amazon('Macbook Pro') # <------ search query goes here.

# Create an Extractor by reading from the YAML file

e = Extractor.from_yaml_file('search_results.yml')

# product_data = []

with open("search_results_urls.txt",'r') as urllist, open('search_results_output.jsonl','w') as outfile:

    for url in urllist.read().splitlines():

        data = scrape(url)

        if data:

            for product in data['products']:

                product['search_url'] = url

                print("Saving Product: %s"%product['title'].encode('utf8'))

                json.dump(product,outfile)

                outfile.write("\n")

                # sleep(5)

Constraints

The script works extremely well on broad searches, although would fail with particular searches with items that return below 5 pages of the results. We might work to improve that in future for scrape amazon product data.

Disclaimer

Just because Amazon won’t need auto extraction of the site and you require to consult.robots file whereas doing the big-scale collection of data. The assignment was helpful as well as made to learn objectives. So, in case, you are being blocked, you would have been warned!

For more details, contact X-Byte Enterprise Crawling or ask for a free quote!

For more visit: https://www.xbyte.io/how-to-extract-amazon-results-with-python-and-selenium.php

Search This Blog

Web Scraping Services