Scrape website tables using Pandas & Requests


manufacturing_companies

One way to acquire data for your research or personal data projects is to download data, known as scraping, directly from websites. There are many ways to go about solving a problem of getting data from a website. Some non-technical methods are available but today, we are going to use the Python programming language to scrape some data from a website data table.

In this brief tutorial, we will get tabular data from Wikipedia. We will use the popular Python libraries Pandas and Requests.

First, we need to install the libraries.

pip install pandas
pip install requests

Next, we start coding. Import or call the libraries we are going to use.

import pandas as pd
import requests
We are going to define the webpage (with a table) we want to scrape. The image above shows the table we are going to scrape.
webpage = 'https://en.wikipedia.org/wiki/List_of_largest_manufacturing_companies_by_revenue'
We will utilize the Requests library to get the webpage. Then use Pandas to read the html from the requests.get.
page = requests.get(webpage)
manufacturing_data = pd.read_html(page.text)
Next, we specify that we want the first table on the page (in case there was more than 1). Zero [0] is to get the first table. We would use [1] for the second table and so on for more tables.
first_table = manufacturing_data[0]
Let’s see what our table looks like with the print function. We will print the top 15 (this will start at 0 and end at 15) .
print(first_table[0:15])
manufacturing companies python output
Looks pretty good. We got the data in the form of Pandas DataFrame from the webpage table. You can see on the far left there is an extra column with no column name and starting with zero. This is a numeric index added by the Pandas library.

 

Next, we want to save our newly scraped data onto our computer in the form of a CSV spreadsheet file.

 

I like to specify my folder I’d like to save to.
# specify your folder location 
data_folder = 'C:/Users/Name/Folder_Location'
Then we will save as a CSV using the Pandas library function, with our data folder location receiving the file. Remember that extra column (index) in the DataFrame in the printout above? Let us remove that from our data by using “index=False”.
first_table.to_csv('{}/manufacturing.csv'.format(data_folder), index=False)

Now, we have downloaded our scraped data into a CSV spreadsheet. We have completed our mission starting with scraping a table online and finishing with the downloaded CSV data.

One thing you will want to consider before web scraping is whether the site (or information) you are scraping has any copyrighted content or other issues that may be against the terms and conditions of the site.

For more information on the legality of web scraping check out these links:

Is Web Scraping Legal?

Web scraping is legal, US appeals court reaffirms – TechCrunch

HiQ Labs v. LinkedIn Case of Web Scraping

 


Article by Zachary Storella – See more programming posts on our Python Page