This chapter explores extracting multiple values with members and dynamic query.

The Example 2 extracts snapshot data from defs/examples/page/acme-quote.html page

  • MARKET CAP
  • EPS (TTM)
  • P/E
  • P/C
  • BOOK VALUE
  • PRICE/BOOK
  • DIV (%)
  • DIV YIELD
  • FACE VALUE
  • INDUSTRY P/E

The snippet of HTML from the page is

<div id="snapshot">
    <div>
        <div>
            <div>MARKET CAP</div>
            <div>382,642.57</div>
            <div></div>
        </div>
        <div>
            <div>P/E</div>
            <div>-</div>
            <div></div>
        </div>
        <div>
            <div>BOOK VALUE</div>
            <div>27.89</div>
            <div></div>
        </div>
   ....

The datadef used to extract data from this page is

<dataDef name="snapshot">
    <axis name="col">
        <xf:fields>
            <xf:script script="document.getFromDate()" />
        </xf:fields>
        <member name="date" />
    </axis>
    <axis name="row">
        <xf:fields>
            <xf:query region="div#snapshot"
                field="div:matchesOwn(^%{row.match})" />
        </xf:fields>
        <member name="MC" match="MARKET CAP" />
        <member name="EPS" match="EPS \(TTM\)" />
        <member name="PE" match="P/E" />
        <member name="PC" match="P/C" />
        <member name="BV" match="BOOK VALUE" />
        <member name="PB" match="PRICE/BOOK" />
        <member name="DIV" match="DIV \(%\)" />
        <member name="DY" match="DIV YIELD" />
        <member name="FV" match="FACE VALUE" />
        <member name="IND PE" match="INDUSTRY P/E" />
    </axis>
    <axis name="fact">
        <xf:fields>
            <xf:query region="div#snapshot"
                field="div:matchesOwn(^%{row.match}) + div" />
        </xf:fields>
    </axis>
</dataDef>

Here, row axis defines multiple members elements with name and match attributes.

The previous example defined price as <member name=“Price” value=“Price” /> with value attribute, instead of match. The value attribute assigns the value directly to the member without any query.

But the present example uses match attribute

<member name="FV" match="FACE VALUE" />

When match attribute is defined it can be accessed through substitution variable %{<axisName>.match} and in the row query, we are using the match as substitution variable in the selector.

<axis name="row">
    <xf:fields>
        <xf:query region="div#snapshot"
            field="div:matchesOwn(^%{row.match})" />
    </xf:fields>

    ... Members ....

</axis>

When Gotz process row axis, for each member defined in the axis it gets the raw query and replaces its %{row.match} with the value of member’s match attribute and then dispatches the query to JSoup and once JSoup returns the content of selected item, it assigns the value to member’s value field.

Let’s see how match=“BOOK VALUE” is handled by each axis. For, HTML snippet

<div>
  <div>BOOK VALUE</div>
  <div>27.89</div>
  <div></div>
</div>

when axis row is processed, selector selects the element <div>BOOK VALUE</div> as it contains matching text and returns its content which is nothing but “BOOK VALUE”

The fact axis uses slightly modified selector

<axis name="fact">
        <xf:fields>
            <xf:query region="div#snapshot"
                field="div:matchesOwn(^%{row.match}) + div" />
        </xf:fields>
</axis>

The fact selector is same as row selector but with the trailing + div. This selector, selects next sibling of the matched element which is <div>27.89</div> and returns its content i.e. 27.89

 
 

When we scrape limited number of items as in this example then it is convenient to use members directly either with match attribute. But, when we scrape large number of items then dataDef definition becomes lengthy and to overcome that, Gotz comes with two more features indexRange and breakAfter.

The next chapter describe indexRange and how to define dataDef with it.