Data Items and Dynamic Query

This chapter explores extracting multiple data points by defining multiple items and dynamic query.

The Example 2 extracts ten data points shown below from defs/examples/fin/page/acme-snapshot.html page.

  • MARKET CAP
  • EPS (TTM)
  • P/E
  • P/C
  • BOOK VALUE
  • PRICE/BOOK
  • DIV (%)
  • DIV YIELD
  • FACE VALUE
  • INDUSTRY P/E

The snippet of HTML from the page is

<div id="snapshot">
    <div>
        <div>
            <div>MARKET CAP</div>
            <div>382,642.57</div>
            <div></div>
        </div>
        <div>
            <div>P/E</div>
            <div>-</div>
            <div></div>
        </div>
        <div>
            <div>BOOK VALUE</div>
            <div>27.89</div>
            <div></div>
        </div>
   ....

The datadef used to extract data from this page is

defs/examples/fin/jsoup/ex-2/job.yml

dataDefs:

  snapshot:
    query:
      block: "div#snapshot"
      selector: "div:matchesOwn(^%{item.match}) + div"
    items: [
      item: { name: "MC", match: "MARKET CAP" },
      item: { name: "EPS", match: "EPS \\(TTM\\)" },
      item: { name: "PE", match: "P/E" },
      item: { name: "PC", match: "P/C" },
      item: { name: "BV", match: "BOOK VALUE" },
      item: { name: "PB", match: "PRICE/BOOK" },
      item: { name: "DIV", match: "DIV \\(%\\)" },
      item: { name: "DY", match: "DIV YIELD" },
      item: { name: "FV", match: "FACE VALUE" },
      item: { name: "IND PE", match: "INDUSTRY P/E" },
    ]  
    dims: [ 
      item: { name: "date", script: "document.getFromDate()" },
    ]

Here, items array defines multiple item elements with name and match properties and dims array defines an item for date.

Dynamic Query

When scraping single data point say MARKET CAP, we can hard code the selector as

query:
  block: "div#snapshot"
  selector: "div:matchesOwn(MARKET CAP) + div"

But, to access say P/E we need another selector as div:matchesOwn(P/E) + div and for the above example, in all we need 10 selectors. Instead, we can use Scoopi dynamic query feature. The hard coded query can be converted into dynamic query with substitution variables.

query:
  block: "div#snapshot"
  selector: "div:matchesOwn(^%{item.match}) + div"

Scoopi dynamically substitute the variable %{item.match} with the match property of the item and fires the query. With the dynamic query, we can scrape any number of item with single query.

When Scoopi process item axis, for each item defined it gets the raw query and replaces the %{item.match} with the item match property and then dispatches the query to JSoup and once JSoup returns the content of selected item, assigns it to item’s value field. Let’s see how item with match=“BOOK VALUE” is handled by Scoopi.

dataDefs:

  snapshot:
    query:
      block: "div#snapshot"
      selector: "div:matchesOwn(^%{item.match}) + div"
    items: [
      item: { name: "BV", match: "BOOK VALUE" },
    ]  
    dims: [ 
      item: { name: "date", script: "document.getFromDate()" },
    ]

The HTML snippet is

<div>
  <div>BOOK VALUE</div>
  <div>27.89</div>
  <div></div>
</div>

For the above datadef, Scoopi internally creates an data item object with three axis.

Axis Name Item Name Query or Script
dim date document.getFromDate()
item BV no query/script name is set as value
fact fact div#snapshot and div:matchesOwn(^%{item.match}) + div

Note that we defined only two axis in datadef - dim (date) and item (BV) and the third axis fact is added by Scoopi by default. It processes each axis starting with dim (date) and assigns date returned by the script as its value. Next, for item (BV) it assigns item’s own name i.e. BV as its value as there is no selector or script defined for that axis. Finally it process fact axis where it takes the raw query and substitute is dynamic variable %{item.match} and it looks for current item axis match field and replace the variable with it. So, the substituted selector becomes div:matchesOwn(BV) + div which scrapes data 27.89 from HTML. We can also use other dynamic variables such as %{item.index} and {item.value}.

The subsequent chapters covers other types of dynamic queries extensively with more examples.

When we scrape limited number of items as done in example 2, then it is convenient to use members directly either with value or match property. But, when large number of items are involved, then dataDef definition becomes lengthy and to overcome that, Scoopi comes with two more features indexRange and breakAfter.

The next chapter covers item properties - index, indexRange and breakAfter.