Parsing a Table from a Website

We’ve previously covered the basics of scraping data from websites . But did you know that scraping can also be used for data in tabular form? If you spend much time on the Internet, you’ve probably noticed that many websites contain tables. For example, flight schedules, product features, service differentiation, TV schedules, ratings, and more.

Data Table from Similarweb.com

In some cases, you can easily copy a table and paste it into Excel without writing a single line of code. But keep in mind that data scientists work with much larger amounts of data, where the copy-paste method is not very efficient. So now we will show you how to parse a table from a website in Python.

STEP 1. INSTALLING LIBRARIES

First of all, we need to install the libraries into telemarketing data our development environment:

BeautifulSoup4
Requests
pandas
lxml

If you have any problems at this stage, we recommend reading the article on web scraping in Python .

STEP 2. IMPORTING LIBRARIES

Once the necessary libraries are installed, we can open Spyder. We choose Spyder because it is more convenient to use for projects. But you can have your own preferences.

The next step in parsing a table in Python is to import the library:

STEP 3. SELECTING A PAGE

In this project, we will be scraping a however, proceed with caution table of covid data from Worldometers . Like the previous tutorial, this website is also built using HTML and is considered easier to understand for beginners.

COVID Data Table

STEP 4. REQUEST PERMISSION

Once we have selected a page to scrape, marketing list we can copy its URL and use requests to ask the server for permission to retrieve data from their site.

The <Response [200]> result means that the server allows us to collect information. Next, we need to process the HTML code with lxml to make it more readable.

STEP 5. VIEWING THE CODE OF TABLE ELEMENTS

In the previous article, we learned how to view the code of each element on a website page. To get information about the code of table elements, we need to check its location first.

Table layout

As you can see from the image above, this table is inside the <table> tag and id = ‘main_table_countries_today’ . Now we can define the variable. In our case, we will define the table as ‘table1’.

STEP 6. CREATING COLUMNS

After creating table1 we can see the location of each column. If we look at all the columns we will notice that they have th same characteristic.

Columns

The figure shows the general characteristics of each column – they are located inside the <th> tag .

After finding the tags, we create a for loop to populate an empty list with our columns. Let’s define the empty list as headers.

Headlines

The list is successfully filled and we can check it again. Let’s look at index 13, there is multiline text. This wrapping can be a problem when we want to make a data frame out of it, so we need to convert it to single line text.

# Convert wrapped text in column 13 into one line text headers[13] = ‘Tests/1M pop’

Result:

Index 13 corrected

STEP 7. CREATE A DATA FRAME

The next step in parsing a table using Python is to create a data frame. I suggest defining the data frame as mydata .

# Create a dataframe mydata = pd.DataFrame(columns = headers)

STEP 8. CREATING A FOR LOOP TO FILL THE DATA FRAME

Once the data frame is ready, we can populate it with elements in each column. Before we create the for loop, we still need to determine the row and column locations of the element.

Data frame

STEP 9. CLEANING THE DATA FRAME

Next, once the data frame has been successfully created, we can delete and clean up unnecessary rows. In our case, we will delete the index 0-6, 222-228, then reset the index to its original state and delete the column.