Query, Block and Selector

So far in this guide, we have covered the basics locators, tasks and datadefs. In this chapter we review selectors and also, show how to get selector or xpath using Firefox or Chrome browsers and fine tune it with Query Analyzer.

Selector

In Scoopi, selector is a common term which refers either to the selector (Jsoup) or to the xpath (HtmlUnit or Selenium driver) that is used to scrape data from the page.

dataDefs:
  bs:
    query:
      block: "body > table > tbody"
      selector: "tr:nth-child(6) > td:nth-child(2)"
    items:  
      - item:
          name: item
          selector: "tr:nth-child(6) > td:nth-child(1)"

    dims:  
      - item:
          name: year
          selector: "tr:nth-child(1) > td:nth-child(2)"

The above datadef defines two axis - items/item and dims/item and three selectors. Let’s understand how they are used. Scoopi creates a data item object to hold the data and for each item in items and dims array it creates an axis and assigns respective selectors defined in item. Additionally, it also adds a default fact axis and assigns the selector defined in query element to the fact axis. For the above datadef, the data item object looks as below.

Axis NameItem NameAxis selector
dimyeartr:nth-child(1) > td:nth-child(2)
itemitemtr:nth-child(6) > td:nth-child(1)
factfacttr:nth-child(6) > td:nth-child(2)

The query element defines two selectors - block and selector.

bs:
    query:
      block: "body > table > tbody"
      selector: "tr:nth-child(6) > td:nth-child(2)"

As already explained, selector defined by query/selector is assigned to fact axis. The selector defined by query/block is also a selector and is common to all other selectors. It is used to select a block of nodes from HTML page and cached. The HTML page is represented as tree of DOM nodes and cached block nodes is sub tree of nodes and to speed up parse, axis selector are fired against the sub tree and not the entire tree.

Finding the Selector with Chrome or Firefox browser

Let’s use Quickstart, we query price data from defs/examples/fin/page/acme-bs.html page to understand the process to find the selector or xpath.

The HTML snippet from the page is shown below.

<table>
  <tr>
    <td colspan="1"></td>
    <td>Dec '16</td><td>Dec '15</td><td>Dec '14</td>
  </tr>
  <tr>
    <td colspan="1">Total Share Capital</td>
    <td>804.72</td><td>801.55</td><td>795.32</td>        
  </tr>

  ....

Open the page in Chrome and select Dec ‘16 and right click to get context menu and select Inspect to open Inspection panel. In the inspection panel, right click to open context menu and choose copy option where we can either copy selector or xpath. Copy selector and paste it to some text editor and repeat the process for Total Share Capital and its value value 804.72. The selector provided by Chrome for each of these items are shown below.

AxisHTML itemSelector
DimDec ‘16body > table > tbody > tr:nth-child(1) > td:nth-child(2)
ItemTotal Share Capitalbody > table > tbody > tr:nth-child(6) > td:nth-child(1)
Fact804.72body > table > tbody > tr:nth-child(6) > td:nth-child(2)

From this we know body > table > tbody is common in all three selectors. Remove the common part from selectors and move it block property.

The procedure is same for Firefox but in copy option select CSS Selector. The selectors returned by Firefox are

HTML itemSelector
Dec ‘16body > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(2)
Total Share Capitalbody > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(5) > td:nth-child(1)
804.72body > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(6) > td:nth-child(2)

Here we can cut the common part of selector body > table:nth-child(1) > tbody:nth-child(1) from selectors and move it to query/block.

Scoopi web scraper get JSoup selector from Chrome

Query Analyzer

Many a times with complex data, the selector provided by browser may not produce the expected results. Scoopi has a little tool - Query Analyzer that useful to fine tune the selector or xpath.

To use Query Analyzer, edit conf/scoopi.properties and set scoopi.defs.dir=/defs/analyzer and scoopi.datastore.enable=false. Next, edit defs/analyzer/job.yml and set locator url to the HTML page you want use. By default, analyzer can analyze selectors using JSoup. To analyze XPath query with HtmlUnit change steps property in analyzeTask to htmlUnitQuery.

Now run Scoopi and it loads the page and displays a prompt

Scoopi Query Analyzer

At the prompt enter the selector or XPath and analyzer displays the matching HTML elements from the page. This helps you to understand and adjust selectors. After analyzer displays the matching elements it waits for next input so that you can continuously loop through till results are as expected.

When matched element is more than 10 lines, analyzer truncates the element and displays only top and bottom 5 lines. To change number of lines to be displayed, use option 3. We can view the page source with option 1 and option 2 writes the page source to a file.

To access complex items, we may have to use functions to access specific item. Learn more about selector in JSoup Selector Syntax and JSoup Selector API.

If you use Docker to run Scoopi, then create a separate container and run it to analyze selectors.


# Create new container for Analyzer
# Regular scoopi container is not interactive and throws `No Line Found` error when used as analyzer!!!

$ docker run -it --name scoopi-analyzer -v "$PWD"/defs:/scoopi/defs -v "$PWD"/conf:/scoopi/conf -v "$PWD"/logs:/scoopi/logs -v "$PWD"/output:/scoopi/output codetab/scoopi:latest

# for subsequent runs, use 
$ docker start -ia scoopi-analyzer
  

The next chapter explains multi-dimensional data.