Introduction to Web Scraping using Selenium
This article first appeared on Medium.com
What is Web Scraping?
As the name suggests, this is a data extraction technique used for extracting data from websites. It is an automated process where an application processes the HTML of a Web Page to extract data for manipulation such as converting the Web page to another format and copying into a local database or spreadsheet for later retrieval or analysis.
What will we build?
In this tutorial we will build a web scraping program that will scrape a Github user profile and get the Repository Names and the Languages for the Pinned Repositories. If you would like to jump straight into the project, here is link to the repo on Github. https://github.com/TheDancerCodes/Selenium-Webscraping-Example
What will we require?
We will also use the following packages and driver. * selenium package — used to automate web browser interaction from Python * ChromeDriver — provides a platform to open up and perform tasks in specified browser. * Virtualenv — to create an isolated Python environment for our project. * Extras - Selenium-Python ReadTheDocs Resource.
- Setup project
- Import Modules
- Make The Request
- Get the Response
- Run the program
Set up the project
Create a new project folder. Within that folder create an
In this file, type in our dependency
Open up your command line & create a virtual environment using the basic command:
$ virtualenv webscraping_example
Next, install the dependency into your virtualenv by running the following command in the terminal:
$(webscraping_example) pip install -r setup.py
Import Required Modules
Specify the modules required for the project.
```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException ```
Make The Request
When making the request we need to consider the following: 1. Pass in the desired website url. 2. Implement a Try and Except for handling a timeout situation should it occur.
```python # Specifying incognito mode as you launch your browser[OPTIONAL] option = webdriver.ChromeOptions() option.add_argument(“ — incognito”) # Create new Instance of Chrome in incognito mode browser = webdriver.Chrome(executable_path='/Library/Application Support/Google/chromedriver', chrome_options=option) # Go to desired website browser.get("https://github.com/TheDancerCodes") # Wait 20 seconds for page to load timeout = 20 try: # Wait until the final element [Avatar link] is loaded. # Assumption: If Avatar link is loaded, the whole page would be relatively loaded because it is among # the last things to be loaded. WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, "//img[@class='avatar width-full rounded-2']"))) except TimeoutException: print("Timed out waiting for page to load") browser.quit() ```
Get the Response
Once we make a request and it is successful, we need to get a response. We will break the response into 2 and combine it at the end. The response is the title and language of the pinned repositories of the Github profile.
```python # find_elements_by_xpath - Returns an array of selenium objects. titles_element = browser.find_elements_by_xpath("//a[@class='text-bold']") # List Comprehension to get the actual repo titles and not the selenium objects. titles = [x.text for x in titles_element] print('TITLES:') print(titles, '\n') # Get all of the pinned repo languages language_element = browser.find_elements_by_xpath("//p[@class='mb-0 f6 text-gray']") languages = [x.text for x in language_element] # same concept as for-loop/ list-comprehension above. print("LANGUAGES:") print(languages, '\n') # Pair each title with its corresponding language using zip function and print each pair for title, language in zip(titles, languages): print("RepoName : Language") print(title + ": " + language, '\n') ```
Run the program
Finally execute the program by running it directly in your IDE or by using the following command:
```python $ (webscraping_example) python webscraping_example.py ```
You can read more on Web Scraping here * Wikipedia Page.
- If you would like to try out this example, here's the link to the source code on github.