How to Use XPATH and CSS to Scrap Data for Free

This article is written to help you quickly understand how to parse data using the iDatica extension . The article is design for people who are not familiar with XPath and CSS. We will consider a little theory and basic (for data parsing) syntax that will help you understand how to collect data from the vast majority of sites.

Using Xpath for Parsing

First of all, you ne to understand what Xpath (XML Path Language) is – it is a language for querying elements of xml markup. This means that by sending a request compos in a certain way, you receive the necessary data in response. A simple analogy is recent mobile phone number data an address in the browser line or a path in the explorer to the desir folder, by typing the correct path you get to the desir site or the desir folder. With Xpath it is the same – we write the path and get to the necessary data, only unlike the browser line, we use Xpath for searching. And in our case, for searching by xml documents in html format, in other words, by the code on which the site is built.

If you right-click on an empty space on the site and select “site source code” or “view page code” in the context menu (it varies in different browsers), influencer content of cat with cake you will end up on the page with the code from which the parser extracts data.

For example, the code might look like thi

As you can see, the code is a tree structure in which each element is mark in a certain way; our task is to indicate to the parser the path to the element we ne.

We will consider further actions using the example of our database catalog at this address.

For further work we will ne the developer tool built into the browser, in Chrome – context menu – view code, in Firefox – context menu – explore.

So, let’s find the path to the product card name:

Right-click on the product name — a context marketing list menu will open, select — “view code” — found the requir element in the code. How can you determine the path to it? As in the case of the explorer, go down from the top category to the “requir folder”. The top directory is “html”, then “body”, then several blocks “div”, “ul”, if at some level there are several blocks with the same name, then in square brackets we write which element in order we ne:

If you write this path down, you get:

You can get this path directly in the extension by clicking on the link icon and clicking on the desir element on the page.

Working with such long paths is not convenient and not all sites can get the path to all elements at once, in some cases it will have to be modifi by studying the features of the site structure. But creating a path to data is much easier and faster. Here we ne to get acquaint with the syntax of XPath.

Xpath Syntax

Relative path

Double slas means a relative path and allows you to find all variants of what you are looking for on the page. Thus — since we were looking for the final h2 element, the entry “//h2” will give the same result as the long path we wrote at the beginning:

This way you can access any elements on the page.

Request conditions

Okay, let’s move on, download the price. In the code, it is not designat by one element, like the h2 header, the price is in the string span element, but there are many of them on the page and they are responsible for different data, how can we access the right one?

If you look at the code, you can see that many elements on the page contain attributes and names, for example, the price element – “span” has a “class” attribute with the name “price” – this is the name we can refer to. To do this, in square brackets, after specifying the element we are looking for, you ne to write the search conditions for this element.