This chapter explains the use of index, indexRange and dynamic query to extract set of data without using multiple queries.

Example 3 extracts data from defs/examples/page/acme-bs.html page, which contains Balance Sheet data of Acme for past five years in a HTML table with 27 rows and 5 columns. The partial contents of the table is shown below.

Dec ‘16

Dec ‘15

Dec ‘14

Dec ‘13

Dec ‘12

Equity

804.72

801.55

795.32

790.18

781.84

Reserves

32,071.87

29,881.73

25,414.29

21,444.92

17,957.00

….

The datadef to extract data from the table is specified in defs/examples/jsoup/ex-3/job.xml which extracts three rows of data for the first column (Dec 2016). DataDef is as below

<dataDef name="bs">
   <axis name="col">
       <xf:fields>
           <xf:query region="table:contains(Sources Of Funds)"
               field="tr:nth-child(1) > td:nth-child(%{col.index})" />
       </xf:fields>
       <member name="year" index="2" />
   </axis>
   <axis name="row">
       <xf:fields>
           <xf:query region="table:contains(Sources Of Funds)"
               field="tr:nth-child(%{row.index}) > td:nth-child(1)" />
       </xf:fields>
       <member name="item">
           <xf:fields>
               <xf:indexRange value="6-8" />
           </xf:fields>
       </member>
   </axis>
   <axis name="fact">
       <xf:fields>
           <xf:query region="table:contains(Sources Of Funds)"
               field="tr:nth-child(%{row.index}) > td:nth-child(%{col.index})" />
       </xf:fields>
   </axis>
</dataDef>

Index can be specified in two ways - index or indexRange. The above dataDef uses index in col axis and indexRange in row axis. It is important to note that the index is attribute of the member whereas indexRange is <xf:fields>.

 
 

The col axis uses index attribute in member element with the index value of 2. The query selector td:nth-child(%{col.index}) uses substitution variable %{col.index} which is replaced by index value 2 and content of second <td> which is nothing but data for Dec ‘16 is returned by the selector for each row.

For axis row, we use indexRange

<member name="item">
    <xf:fields>
        <xf:indexRange value="6-8" />
    </xf:fields>
</member>

and for its selector, tr:nth-child(%{row.index}) > td:nth-child(1), Gotz creates 3 members and allots index 6, 7 and 8 respectively.

The Fact axis selector uses both the col and row indexes

tr:nth-child(%{row.index}) >  td:nth-child(%{col.index})

The selector for three members after variables are substituted is as follows (index value is shown inside the [ ])

Member 1

col [2] : tr:nth-child(1) > td:nth-child(2)
row [6] : tr:nth-child(6) > td:nth-child(1)
fact [] :  tr:nth-child(6) >  td:nth-child(2)

Member 2

col [2] : tr:nth-child(1) > td:nth-child(2)
row [7] : tr:nth-child(7) > td:nth-child(1)
fact []:  tr:nth-child(7) >  td:nth-child(2)

Member 3

col [2] : tr:nth-child(1) > td:nth-child(2)
row [8] : tr:nth-child(8) > td:nth-child(1)
fact []:  tr:nth-child(8) >  td:nth-child(2)

Get all data from table

To get all data for one year, change the indexRange of row axis from 6-8 to 6-38 and run Gotz. The output data should have 33 rows of data for the one year i.e. Dec 2016.

Next, in col axis, remove index attribute from member and copy the xf:fields which contains indexRange from row axis to col axis member and change its range to 2-7. Now, run Gotz and you should have about 198 rows of data in output. We leave these as exercise.

 
 

BreakAfter

In the previous example, finding indexRange for col was easy because of limited number of columns, but for row, it was tedious.

The breakAfter feature comes handy when rows or columns are more or when in-between data grows or contracts.

The job.xml file of Example-4 is same as the third example, but for row axis, it uses breakAfter along with index instead of IndexRange.

<axis name="row">
   <xf:fields>
      <xf:query region="table:contains(Sources Of Funds)"
            field="tr:nth-child(%{row.index}) > td:nth-child(1)" />
   </xf:fields>
   <member name="item" index="5">
        <xf:fields>
            <xf:breakAfter value="Book Value" />
          </xf:fields>
    </member>
</axis>

Member begins with index 5 and loops until selector returns “Book Value” and outputs full set of Balance sheet data. In case, index attribute is not set then index begins with 1.

The breakAfter is also useful when first and last item is constant and in-between rows or cols shrinks or expands. Using breakAfter, we can consume all data between them without bothering about the index.

If we go through the output of Example-4, we see a lot of unwanted data such as sub-headings, texts such as “12 months”, nulls and blanks etc., and this is a common problem when dealing with unstructured sources. To handle this, Gotz allows us to define filters and the next chapter covers it.