By Zac Clancy for Kite.com
Table of Contents
- Introducing web scraping
- Some use cases of web scraping
- How does it work?
- Robots.txt
- A simple example
- Working with HTML
- Data processing
- Next steps
Introducing web scraping
Simply put, web scraping is one of the tools developers use to gather and analyze information from the Internet.
Some websites and platforms offer application programming interfaces (APIs) which we can use to access information in a structured way, but others might not. While APIs are certainly becoming the standard way of interacting with today’s popular platforms, we don’t always have this luxury when interacting with most of the websites on the internet.
Rather than reading data from standard API responses, we’ll need to find the data ourselves by reading the website’s pages and feeds.
Some use cases of web scraping
The World Wide Web was born in 1989 and web scraping and crawling entered the conversation not long after in 1993.
Before scraping, search engines were compiled lists of links collected by the website administrator, and arranged into a long list of links somewhere on their website. The first web scraper and crawler, the World Wide Web Wanderer, were created to follow all these indexes and links to try and determine how big the internet was.
It wasn’t long after this that developers started using crawlers and scrapers to create crawler-based search engines that didn’t require human assistance. These crawlers would simply follow links that would come across each page and save information about the page. Since the web is a collaborative effort, the crawler could easily and infinitely follow embedded links on websites to other platforms, and the process would continue forever.
Nowadays, web scraping has its place in nearly every industry. In newsrooms, web scrapers are used to pull in information and trends from thousands of different internet platforms in real time.
Spending a little too much on Amazon this month? Websites exist that will let you know, and, in most cases, will do so by using web scraping to access that specific information on your behalf.
Machine learning and artificial intelligence companies are scraping billions of social media posts to better learn how we communicate online.
So how does it work?
The process a developer builds for web scraping looks a lot like the process a user takes with a browser:
- A URL is given to the program.
- The program downloads the response from the URL.
- The program processes the downloaded file depending on data required.
- The program starts over at with a new URL
The nitty gritty comes in steps 3 and, in which data is processed and the program determines how to continue (or if it should at all). For Google’s crawlers, step 3 likely includes collecting all URL links on the page so that the web scraper has a list of places to begin checking next. This is recursiveby design and allows Google to efficiently follow paths and discover new content.
There are many heavily used, well built libraries for reading and working with the downloaded HTML response. In the Ruby ecosystem Nokogiri is the standard for parsing HTML. For Python, BeautifulSoup has been the standard for 15 years. These libraries provide simple ways for us to interact with the HTML from our own programs.
These code libraries will accept the page source as text, and a parser for handling the content of the text. They’ll return helper functions and attributes which we can use to navigate through our HTML structure in predictable ways and find the values we’re looking to extract.
Scraping projects involve a good amount of time spent analyzing a web site’s HTML for classes or identifiers, which we can use to find information on the page. Using the HTML below we can begin to imagine a strategy to extract product information from the table below using the HTML elements with the classes products
and product
.
<table class="products">
<tr class="product">...</tr>
<tr class="product">...</tr>
</table>
In the wild, HTML isn’t always as pretty and predictable. Part of the web scraping process is learning about your data and where it lives on the pages as you go along. Some websites go to great lengths to prevent web scraping, some aren’t built with scraping in mind, and others just have complicated user interfaces which our crawlers will need to navigate through.
Robots.txt
While not an enforced standard, it’s been common since the early days of web scraping to check for the existence and contents of a robots.txt file on each site before scraping its content. This file can be used to define inclusion and exclusion rules that web scrapers and crawlers should follow while crawling the site. You can check out Facebook’s robots.txt file for a robust example: this file is always located at /robots.txt so that scrapers and crawlers can always look for it in the same spot. Additionally, GitHub’s robots.txt, and Twitter’s are good examples.
An example robots.txt file prohibits web scraping and crawling would look like the below:
User-agent: *
Disallow: /
The User-agent: *
section is for all web scrapers and crawlers. In Facebook’s, we see that they set User-agent
to be more explicit and have sections for Googlebot, Applebot, and others.
The Disallow: /
line informs web scrapers and crawlers who observe the robots.txt file that they aren’t permitted to visit any pages on this site. Conversely, if this line read Allow: /
, web scrapers and crawlers would be allowed to visit any page on the website.
The robots.txt file can also be a good place to learn information about the website’s architecture and structure. Reading where our scraping tools are allowed to go – and not allowed to go – can help inform us on sections of the website we perhaps didn’t know existed, or may not have thought to look at.
If you’re running a website or platform it’s important to know that this file isn’t always respected by every web crawler and scraper. Larger properties like Google, Facebook, and Twitter respect these guidelines with their crawlers and information scrapers, but since robots.txt is considered a best practice rather than an enforceable standard, you may see different results from different parties. It’s also important not to disclose private information which you wouldn’t want to become public knowledge, like an admin panel on /admin
or something like that.
A simple example
To illustrate this, we’ll use Python plus the BeautifulSoup
and Requests libraries.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://google.com')
soup = BeautifulSoup(page.text, 'html.parser')
We’ll go through this line-by-line:
page = requests.get('https://google.com')
This uses the requests
library to make a request to https://google.com
and return the response.
soup = BeautifulSoup(page.text, 'html.parser')
The requests
library assigns the text of our response to an attribute called text
which we use to give BeautifulSoup
our HTML content. We also tell BeautifulSoup
to use Python 3’s built-in HTML parser html.parser
.
Now that BeautifulSoup
has parsed our HTML text into an object that we can interact with, we can begin to see how information may be extracted.
paragraphs = soup.find_all('p')
Using find_all
we can tell BeautifulSoup
to only return HTML paragraphs <p>
from the document.
If we were looking for a div with a specific ID (#content
) in the HTML we could do that in a few different ways:
element = soup.select('#content')
# or
element = soup.find_all('div', id='content')
# or
element = soup.find(id='content')
In the Google scenario from above, we can imagine that they have a function that does something similar to grab all the links off of the page for further processing:
links = soup.find_all('a', href=True)
The above snippet will return all of the <a>
elements from the HTML which are acting as links to other pages or websites. Most large-scale web scraping implementations will use a function like this to capture local links on the page, outbound links off the page, and then determine some priority for the links’ further processing.
Working with HTML
The most difficult aspect of web scraping is analyzing and learning the underlying HTML of the sites you’ll be scraping. If an HTML element has a consistent ID or set of classes, then we should be able to work with it fairly easily, we can just select it using our HTML parsing library (Nokogiri, BeautifulSoup
, etc). If the element on the page doesn’t have consistent classes or identifiers, we’ll need to access it using a different selector.
Imagine our HTML page contains the following table which we’d like to extract product information from:
NAME | CATEGORY | PRICE |
Shirt | Athletic | $19.99 |
Jacket | Outdoor | $124.99 |
BeautifulSoup
allows us to parse tables and other complex elements fairly simply. Let’s look at how we’d read the table’s rows in Python:
# Find all the HTML tables on the page
tables = soup.find_all('table')
# Loop through all of the tables
for table in tables:
# Access the table's body
table_body = table.find('tbody')
# Grab the rows from the table body
rows = table_body.find_all('tr')
# Loop through the rows
for row in rows:
# Extract each HTML column from the row
columns = row.find_all('td')
# Loop through the columns
for column in columns:
# Print the column value
print(column.text)
The above code snippet would print Shirt
, followed by Athletic
, and then $19.99
before continuing on to the next table row. While simple, this example illustrates one of the many strategies a developer might take for retrieving data from different HTML elements on a page.
Data processing
Researching and inspecting the websites you’ll be scraping for data is a crucial component to each project. We’ll generally have a model that we’re trying to fill with data for each page. If we were scraping restaurant websites we’d probably want to make sure we’re collecting the name, address, and the hours of operation at least, with other fields being added as we’re able to find the information. You’ll begin to notice that some websites are much easier to scrape for data than others – some are even defensive against it!
Once you’ve got your data in hand there are a number of different options for handling, presenting, and accessing that data. In many cases you’ll probably want to handle the data yourself, but there’s a slew of services offered for many use cases by various platforms and companies.
- Search indexing: Looking to store the text contents of websites and easily search? Algolia and Elasticsearch are good for that.
- Text analysis: Want to extract people, places, money and other entities from the text? Maybe spaCy or Google’s Natural Language API are for you.
- Maps and location data: If you’ve collected some addresses or landmarks, you can use OpenStreetMap or MapBox to bring that location data to life.
- Push notifications: If you want to get a text message when your web crawler finds a specific result, check out Twilio or Pusher.
Next steps
In this post, we learned about the basics of web scraping and looked at some simplistic crawling examples which helped demonstrate how we can interact with HTML pages from our own code. Ruby’s Nokogiri, Python’s BeautifulSoup
, and JavaScript’s Nightmare are powerful tools to begin learning web scraping with. These libraries are relatively simple to start with, but offer powerful interfaces to begin to extend in more advanced use cases.
Moving forward from this post, try to create a simple web scraper of your own! You could potentially write a simple script that reads a tweet from a URL and prints the tweet text into your terminal. With some practice, you’ll be analyzing HTML on all the websites you visit, learning its structure, and understanding how you’d navigate its elements with a web scraper.
This article originally appeared on Kite.com (Reprinted with permission)