This is the only step-by-step guide you will need in order to start collecting web data from target sites, and saving them as CSV files in under 10 minutes
In this article we will discuss:
Selenium: What it is, and how it is used
Selenium is an open-source software that includes a variety of tools, and libraries that enable browser automation activities, including:
- Web page-based element actions/retrieval (e.g. close, back, get_cookie, get_screenshot_as_png, get_window_size)
- Site testing
- Managing alert prompts, and cookies (adding/removing)
- Form element submission
- Data collection / web scraping
For your convenience, I have included a link to the official Selenium 4.1.5 documentation library.
Puppeteer vs. Selenium
A step-by-step guide to scraping with Selenium
Step One: Install Selenium
For those of you that have pip (i.e. package installer for Python) on your computers, all you need to do is open it up and type in:
pip install -U selenium
Otherwise, you can download PyPI, unarchive it, and run:
python setup.py install
Do note that you will need a driver so that Selenium can interface with your browser of choice. Here are links to some of the most popular browser drivers for your convenience:
Let’s use Firefox as an example browser. You would accomplish this by opening up Firefox, going to a web page, say Yahoo, searching for “seleniumhq”, and then closing the browser. Here’s what that would look like in code:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys browser = webdriver.Firefox() browser.get('http://www.yahoo.com') assert 'Yahoo' in browser.title elem = browser.find_element(By.NAME, 'p') # Find the search box elem.send_keys('seleniumhq' + Keys.RETURN) browser.quit()
Step Two: Importing supporting packages
Selenium is not used in isolation but rather in tandem with other programs including Pandas (an easy to use open source data analysis tool), for example. Here is what you should be typing in, in order to accomplish this:
from selenium import webdriver import time import pandas as pd
Step Three: Defining variables
In this step we will define our target folder, search query, and target site. In this example we will be aiming to map different job opportunities as displayed by competing companies on LinkedIn. What you type in should look something like this:
FILE_PATH_FOLDER = 'F:....Competitive_Analysis' search_query = 'https://www.linkedin.com/q-chief-financial-officer-jobs.html' driver = webdriver.Chrome(executable_path='C:/.../chromedriver_win32/chromedriver.exe') job_details = 
Step Four: HTML tag inspection
HTML sites typically have a unique identifier for each tag that is associated with information being displayed on any given target site. The technique here is to leverage this HTML site property in order to crawl the target site at hand. You can accomplish this by:
- Right clicking anywhere on the page, and hitting ‘inspect’
- And then either clicking the arrow that appears at the top left hand corner or by pushing the Ctrl+Shift+C keys in order to inspect a specific element and obtain the desired HTML tag
Here’s what that looks like:
driver.get(search_query) time.sleep(5) job_list = driver.find_elements_by_xpath("//div[@data-tn-component='organicJob']")
Step Five: Specific data point extraction
We will extract our target data points by utilizing the ‘find_elements_by_xpath’ attribute on the Selenium web driver, and quit the driver, and close the browser once the target data has been collected.
We will target data points as follows:
- Job title
- Job location
- Job description
- Date job was uploaded
Here’s what that looks like:
for each_job in job_list: # Getting job info job_title = each_job.find_elements_by_xpath(".//h2[@class='title']/a") job_company = each_job.find_elements_by_xpath(".//span[@class='company']") job_location = each_job.find_elements_by_xpath(".//span[@class='location accessible-contrast-color-location']") job_summary = each_job.find_elements_by_xpath(".//div[@class='summary']") job_publish_date = each_job.find_elements_by_xpath(".//span[@class='date ']") # Saving job info job_info = [job_title.text, job_company.text, job_location.text, job_summary.text, job_publish_date.text] # Saving into job_details job_details.append(job_info) driver.quit()
Please note that these selectors can be changed by the target, so people should confirm that the selectors in question are correct, do not assume that they are.
Step Six: Saving the data in preparation for output
At this point you will want to add columns to the data frame and make use of the ‘to_csv’ attribute in order to save all of the obtained data in CSV format as follows:
job_details_df = pd.DataFrame(job_details) job_details_df.columns = ['title', 'company', 'location', 'summary', 'publish_date'] job_details_df.to_csv('job_details.csv', index=False)
Your desired CSV file will be downloaded to the following location: FILE_PATH_FOLDER
That’s it, you have just successfully completed your first web scraping job with Selenium.
Integrating proxies with Selenium
Integrating proxies with Selenium can help you:
- Perform data collection in variety of geolocations
- Enable you to collect data at scale without the risk of being blocked (e.g. rate limitations on IPs that send ‘too many’ concurrent/consecutive data requests). In this context you may also want to look into dedicated web unlocking services.
- Empower you to collect data from the viewpoint of a real user IP so that you are not served misleading information by potentially suspicious target sites
Proxy integration with Selenium can be accomplished by:
- Going to your Bright Data Dashboard and clicking ‘create a Zone’.
- Choosing ‘Network type’ and then clicking ‘save’.
- Then heading to Selenium, and filling in the ‘Proxy IP:Port’ in the ‘setProxy’ function for example: zproxy.lum-superproxy.io:22225 of both HTTP and HTTPS.
- Under ‘sendKeys’ input your Bright Data account ID and proxy Zone name:lum-customer-CUSTOMER-zone-YOURZONE and your Zone password found in the Zone settings.