Raw data is the foundation for successful data science work. There are many sources of data, and websites are one of them. They can often be a secondary source of information, such as: data aggregation sites ( Worldometers ), news sites (CNBC), social media (Twitter), e-commerce platforms (Shopee), and so on. These websites provide the information needed for data science projects.
But how do we collect the data? We can’t copy and paste it manually, can we? In such a situation, the solution to the problem is web scraping in Python. This programming language has a powerful library BeautifulSoup, as well as an automation tool Selenium. Both of them are often used by specialists to collect data of different formats. In this section, we will first get acquainted with BeautifulSoup.
STEP 1. INSTALLING LIBRARIES
First of all, we need to install the buy telemarketing data necessary libraries, namely:
- BeautifulSoup4
- Requests
- pandas
- lxml
To install the library, you can use pip install [library name] or conda install [library name] if you have Anaconda Prompt.
“Requests” is our next library to install. Its job is to ask the server for permission if we want to get data from its website. Then we need to install pandas to create the data frame and lxml to change the HTML into a Python-friendly format.
STEP 2. IMPORTING LIBRARIES
After installing the libraries, let’s open your favorite development environment. We suggest using Spyder 4.2.5. Later, at some stages of work, we will encounter large volumes of output data, and then Spyder will be more convenient to use than Jupyter Notebook.
So, Spyder is open and we can import the the best way to do this is required library:
STEP 3. SELECTING A PAGE
In this project, we will use webscraper.io . Since this website is built in HTML, the code is easier and more understandable even for beginners. We chose this page to scrape the data:
It is a prototype of an online store website. We will parse data about computers and laptops, such as product name, price, description and reviews.
STEP 4. REQUEST FOR PERMISSION
Once we select a page, we copy its URL marketing list and use request to ask the server for permission to retrieve data from their site.
The <Response[200]> result means that the server allows us to collect data from their website. We can use the request.get function to check.
When you run this code, you will get a jumbled text output that is not suitable for Python. We need to use a parser to make it more readable.
STEP 5. VIEW THE ELEMENT CODE
For Python web scraping, we recommend using Google Chrome, it is very convenient and easy to use. Let’s learn how to view the web page code using Chrome. First, you need to right-click the page you want to check, then click View code and you will see this:
Then click Select an element on the page to test and you will notice that as you move your cursor to each element on the page, the elements menu shows its code.
For example, if we move the cursor to Test Sites , the element will show that Test Sites is in an h1 tag . In Python, if you want to view the code of site elements, you can call tags. A characteristic feature of tags is that they always have < as a prefix and are often purple.