DataDef

Scoopi uses datadef to define data. Datadef contains query, items and dims which collectively defines the data to be scrapped from the HTML page.

In this chapter, we go through Quickstart job.xml to explain dataDef. This job.yml uses a simple DataDef which scrape one data point i.e. price of the company share from defs/examples/fin/page/acme-snapshot.html page.

The datadefs snippet from defs/examples/fin/jsoup/quickstart/job.yml is as below

dataDefs:
  price:
    query:
      block: "div#price_tick"
      selector: "*"
    items: [ 
     item: { name: "Price", value: "Price" },
    ]

It defines a simple dataDef named price which has two elements - query and items.

Query

The query defines two properties - block and selector and they are both JSoup selectors (or HtmlUnit xpath) used to query the data from page.

price:
  query:
    block: "div#price_tick"
    selector: "*"

These two selectors - block and selector scrape the price data from the page. We explain how find out selector or XPath and break them into block and selector in a later chapter. For now, we concentrate on basics of datadef.

Items and item

Once data is scraped, Scoopi has to associate it with some thing to identify it, otherwise its meaning is lost. The items and item accomplishes that.

items: [ 
  item: { name: "Price", value: "Price" },
]

The items is an array of multiple item. In the example, we define single item named Price and set its value to label Price. Once Scoopi scrapes data say 315.25 from the page it associates the data with the item named Price and through the item we can derive 315.25 is actually denotes the price.

The output of quickstart contains stock price but doesn’t tell on what date. In the next chapter, we explain dimensions which is used to add date.