Index, IndexRange and BreakAfter
This chapter explains the use of index, indexRange and breakAfter to extract set of data without using multiple queries.
The Example 3 extracts multiple data from defs/examples/fin/page/acme-bs.html page, which contains Balance Sheet data of Acme for past five years in a HTML table with 27 rows and 5 columns. The partial contents of the table is shown below.
Item | Dec ‘16 | Dec ‘15 | Dec ‘14 | Dec ‘13 | Dec ‘12 |
---|---|---|---|---|---|
Total Share Capital | 804.72 | 801.55 | 795.32 | 790.18 | 781.84 |
Reserves | 32,071.87 | 29,881.73 | 25,414.29 | 21,444.92 | 17,957.00 |
…. 25 more rows …. |
The datadef to extract data from the table is specified in
defs/examples/jsoup/ex-3/job.xml
which extracts three rows of data for
the first column (Dec 2016). DataDef is as below
bs:
axis:
fact:
query:
region: "table:contains(Sources Of Funds)"
field: "tr:nth-child(%{row.index}) > td:nth-child(%{col.index})"
col:
query:
region: "table:contains(Sources Of Funds)"
field: "tr:nth-child(1) > td:nth-child(%{col.index})"
members: <a href="https://github.com/maithilish/scoopi-scraper/blob/master/engine/src/main/resources/defs/examples/jsoup/ex-4/job.yml" target="_blank">
member: {name: year, index: 2},
]
row:
query:
region: "table:contains(Sources Of Funds)"
field: "tr:nth-child(%{row.index}) > td:nth-child(1)"
members: [
member: {name: item, indexRange: 7-9 },
]
Index can be specified in two ways - index or indexRange. The above dataDef uses index in col axis and indexRange in row axis.
The col axis uses index property in member element with the index value of 2. The query selector td:nth-child(%{col.index}) uses substitution variable %{col.index} which is replaced by index value 2 and content of second <td> whose value Dec ‘16 is returned by the selector for each row.
The row axis member uses indexRange
member: {name: item, indexRange: 7-9 }
For its selector, tr:nth-child(%{row.index}) > td:nth-child(1), parser internally creates 3 members and allots index 6, 7 and 8 respectively.
The Fact axis selector uses both the col and row indexes
tr:nth-child(%{row.index}) > td:nth-child(%{col.index})
The selector for three members after variables are substituted is as follows (index value is shown inside the [ ])
Member 1
col [2] : tr:nth-child(1) > td:nth-child(2)
row [6] : tr:nth-child(6) > td:nth-child(1)
fact [] : tr:nth-child(6) > td:nth-child(2)
Member 2
col [2] : tr:nth-child(1) > td:nth-child(2)
row [7] : tr:nth-child(7) > td:nth-child(1)
fact []: tr:nth-child(7) > td:nth-child(2)
Member 3
col [2] : tr:nth-child(1) > td:nth-child(2)
row [8] : tr:nth-child(8) > td:nth-child(1)
fact []: tr:nth-child(8) > td:nth-child(2)
Get all data from table
To get all data for one year, change the indexRange of row axis from 7-9 to 7-39 and run Scoopi. The output data should have 33 rows of data for the one year i.e. Dec 2016.
Next, in col axis, change index property to indexRange: 2-6. Now, run Scoopi and you should have about 198 rows of data in output. We leave these as exercise.
BreakAfter
In the previous example, finding indexRange for col was easy because of limited number of columns, but for row axis it was tedious.
The breakAfter feature comes handy when rows or columns are more or when in-between data grows or contracts.
The job.xml file of next example, [Example 4 is same as the previous example, but in row axis, it uses breakAfter along with index instead of IndexRange.
row:
query:
region: "table:contains(Sources Of Funds)"
field: "tr:nth-child(%{row.index}) > td:nth-child(1)"
members: [
member: {name: item, index: 5, breakAfter: ["Book Value (Rs)"] },
]
Member begins with index 5 and iterates until selector returns “Book Value” and outputs full set of Balance sheet data. In case, index property is not set then index begins with 1.
The breakAfter is also useful when first and last item is constant and in-between rows or cols shrinks or expands. Using breakAfter, we can scrape all data between them without bothering about the index. Note that breakAfter is array and we can define multiple items and iteration breaks as such any one is matched.
If we go through the output of Example-4, we see a lot of unwanted data such as sub-headings, texts such as “12 months”, nulls and blanks etc., and this is a common problem when dealing with unstructured sources. To handle this, Scoopi allows us to define filters and the next chapter covers it.