DataDef Dimensions

The previous quickstart example scrapes stock price data from the page and associates it with the item named price. However, we are also interested to know on what date the price was scraped. The dims allows Scoopi to add multiple dimensions to scraped data.

In this chapter, we go through Example 1 job.xml to explain dimensions. The example adds date dimension to scraped price data.

Dimensions

The datadefs snippet from defs/examples/fin/jsoup/ex-1/job.yml is as below

dataDefs:

  price:
    query:
      block: "div#price_tick"
      selector: "*"
    items: [ 
      item: { name: "Price", value: "Price" },
    ]
    dims: [ 
      item: { name: "date", script: "document.getFromDate()" },
    ]

It uses dims array to define date dimension. The dims array consists of one or more item which similar to items item element.

For price item, we hard code the value to price, but for date item, we use script property to set the value. Script gets the value using the built-in JavaScript script engine and we call getFromDate() method on document object which returns the date and time when document was loaded.

The dims array can define multiple dimensions and we explain it in a later chapter.

Items, Dimensions and Axis

Internally Scoopi defines data by data items which is composed of axis which is similar in concept to that of spreadsheet.

Scoopi DataDef Dimensions

For price datadef, it defines three axis - FACT, ITEM and DATE. The data we are interested in is called as Fact which is same as value held by a cell in spreadsheet. The other two axes, ITEM and DATE say something about the Fact. For example, if price of company stock is say 121.80 on 01-01-2018, then the axis values are as below

FACT    : 121.80
ITEM    : Price
DATE    : 01-01-2018

From the combination of three axes we deduce that price as on Jan 1st 2018 is 121.80.

The item elements defined either in items or dims arrays are internally mapped to axis. Even though the fact axis is not defined in datadef, Scoopi adds default fact axis and assigns selector defined in query element to it. The value retrieved by the query/selector is set to value field of FACT axis.

The concept of dimensions, axis and fact are borrowed from Multidimensional Expression (MDX) language used in Data Warehouse which allows us to construct multidimensional data. It provides extreme flexibility as we move on to more complex data definitions. Just with few lines we can scrape huge set of data and associate it in multiple way.

The next chapter covers multiple items and dynamic queries.