Index, IndexRange and BreakAfter

This chapter explains the use of index, indexRange and breakAfter to extract large set of data.

The Example 3 extracts multiple data from defs/examples/fin/page/acme-bs.html page, which contains Balance Sheet data of company for past five years in a HTML table with 27 rows and 5 columns. The partial contents of the table is shown below.

ItemDec ‘16Dec ‘15Dec ‘14Dec ‘13Dec ‘12
Total Share Capital804.72801.55795.32790.18781.84
Reserves32,071.8729,881.7325,414.2921,444.9217,957.00
…. 25 more rows ….

The datadef to extract data from the table is specified in defs/examples/fin/jsoup/ex-3/job.xml which extracts three rows of data for the first column (Dec 2016). DataDef is as below

dataDefs:
  bs:
    query:
      block: "table:contains(Sources Of Funds)"
      selector: "tr:nth-child(%{item.index}) > td:nth-child(%{dim.year.index})"
    items:
      - item:
          name: item
          selector: "tr:nth-child(%{index}) > td:nth-child(1)"
          indexRange: 7-9
    dims:
      - item:
          name: year
          selector: "tr:nth-child(1) > td:nth-child(%{index})"
          index: 2

Index can be specified in two ways - index or indexRange. The above dataDef uses index in dim axis and indexRange in item axis.

This conspicuous looking datadef can scrape lots of data with just few lines of definition. Let’s go through it thoroughly to understand the underlining concepts.

DataDef Breakdown

This datadef defines three selectors - query/selector, item/selector and dim/selector and, also one query/block selector. The query/selector is used to get the value of fact axis while item/selector and dim/selector are used get value respective axis. The query/block is common to all three selectors and it is used to select a block of nodes from HTML page and cached for performance. The HTML page is represented as tree of DOM nodes and cached block nodes is sub tree of nodes and to speed up parse, selector are fired against the sub tree and not the entire tree.

The effective selectors for above datadef is as shown in below table.

Axis NameItem NameEffective selector
dimyeartable:contains(Sources Of Funds) tr:nth-child(1) > td:nth-child(%{index})
itemitemtable:contains(Sources Of Funds) tr:nth-child(%{index}) > td:nth-child(1)
factfacttable:contains(Sources Of Funds) tr:nth-child(%{item.index}) > td:nth-child(%{dim.year.index})

As we can see, the query/block table:contains(Sources Of Funds) is common to all selectors. Hereafter,for brevity the block is not shown along with selector.

Once we know which selector is used for which axis, lets see how index or indexRange is handled. At the start of parse, Scoopi creates an item object with three axis.

Axis NameItem NameAxis selector (common query/block is omitted)index
dimyeartr:nth-child(1) > td:nth-child(%{index})2
itemitemtr:nth-child(%{index}) > td:nth-child(1)7
factfacttr:nth-child(%{item.index}) > td:nth-child(%{dim.year.index})

For dim (year) the index is set 2 which is as defined but for item (item) the index is set to 7 which is start of defined indexRange 7-9. As it processes axis one by one it replaces the variables. For axis dim and item the %{index} is replaced with their own index values and fact axis %{item.index} the item index 2 is used and for %{dim.year.index} the dim (year) index which is 7 is used. The replaced selector becomes

Axis NameItem NameAxis Selectorindex
dimyeartr:nth-child(1) > td:nth-child(2)2
itemitemtr:nth-child(7) > td:nth-child(1)7
factfacttr:nth-child(2) > td:nth-child(7)

That explains how the first data item is processed by Scoopi, but as indexRange is 7-9 it has to handle all indexes of the range. To do that, after processing first data item it creates second data item where axis item index is set to 8 and process it.

data item 2

dim [2] : tr:nth-child(1) > td:nth-child(2)
item [8] : tr:nth-child(8) > td:nth-child(1)
fact []:  tr:nth-child(2) > td:nth-child(8)

Next it creates third data item with axis item index set to 9 and process it

data item 3

dim [2] : tr:nth-child(1) > td:nth-child(2)
item [9] : tr:nth-child(9) > td:nth-child(1)
fact []:  tr:nth-child(2) > td:nth-child(9)

For all the three data items, the index of axis dim (year) is constant i.e. 2 as plain index is defined for that axis.

Get all data from the table

To get all data for one year, change the indexRange of axis item (item) from 7-9 to 7-39 and run Scoopi. The output data should have 33 rows of data for the one year i.e. Dec 2016.

Next, in axis dim (year), change index property to indexRange: 2-6. Now, run Scoopi and you should have about 198 rows of data in output which entire data from year Dec 2012 to Dec 2016. We leave these as exercise.

BreakAfter

In the previous example, finding indexRange for axis dim (year) was easy because of limited number of columns in the table, but for axis item (item) it was tedious as we have to count table rows.

The breakAfter feature comes handy when rows or columns are more or when in-between data grows or contracts.

The job.xml file of next example, Example 4 is same as the previous example, but in items/item, it uses breakAfter along with index instead of IndexRange.

dataDefs:
  bs:
    query:
      block: "table:contains(Sources Of Funds)"
      selector: "tr:nth-child(%{item.index}) > td:nth-child(%{dim.year.index})"
    items:
      - item:
          name: item
          selector: "tr:nth-child(%{index}) > td:nth-child(1)"
          index: 5
          breakAfter:
            - "Book Value (Rs)"
    dims:
      - item:
          name: year
          selector: "tr:nth-child(1) > td:nth-child(%{index})"
          indexRange: 2-6

Now the first data item begins with index 5 and additional data item are created until selector returns “Book Value”. Once selector returns Book Value (Rs) parser terminates. In case, index property is not defined then index is set to 1 as default.

The breakAfter is also useful when first and last item is constant and in-between rows or cols shrinks or expands. Using breakAfter, we can scrape all data between them without bothering about the range. Note that breakAfter is array and we can define multiple items and iteration breaks when any one is matched.

If we go through the output of Example-4, we see a lot of unwanted data such as sub-headings, texts such as “12 months”, nulls and blanks etc., and this is a common problem when dealing with unstructured sources. To handle this, we can define filters and the next chapter covers it.