Defs, Locators and Tasks

Scoopi uses YML definition files to extract data from HTML pages. To learn the YML elements used by the definition files, Scoopi distribution comes with a set of examples which are under def/examples folder.

Scoopi Definition Files

Scoopi creates the data model based on YML definition files. We can specify the definition file using scoopi.defs.dir configuration property, which is normally set in scoopi.properties file located in conf folder. By default, it is set to defs/examples/fin/jsoup/quickstart which loads the quickstart example. As we progress through the examples, you need to edit conf/scoopi.properties file and set property scoopi.defs.dir to the specific example to run it.

Def file

The def file defines the definition required to run Scoopi. In examples, we have named the definition file as job.yml but it can be named anything as long as file extension is yml. In otherwords, any file from the defs directory with file extension yml is loaded by scooopi as definition file.

The top level elements in the job.yml are

  • locatorGroups
  • taskGroups
  • dataDefs

In this chapter, we go through Quickstart job.xml and explain locatorGroups and taskGroups elements. Refer Scoopi Installation to know how to run Scoopi and examples.

LocatorGroups

LocatorGroups defines list of locators. The locator specifies the name and URL of the HTML page to fetch from the web or local file system.

In the example job.yml, the locatorGroups is defined as

defs/examples/fin/jsoup/quickstart/job.yml

locatorGroups:

  snapshotGroup:
    locators: [
      { name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" }
    ]

It defines a locatorGroup named snapshotGroup which in turn defines one locator. The locator name is acme and its url points to local HTML file acme-snapshot.html which is in defs/examples/fin/page folder.

Here is one more example with two groups

locatorGroups:

  groupA:
    locators: [
       { name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" },
       { name: exPage, url: "http://example.org" }    
    ]

  groupB:
    locators: [
       { name: acme, url: "/defs/examples/fin/page/acme-bs.html" }  
    ]

It defines two locatorGroups named groupA and groupB. The first group defines two locators and the second group defined one locator. To scrape pages from website, specify the actual address of the page such as http://example.org.

Please note that in the above examples we have used JSON array construct using [ ] and {} as we can define one locator per line. But, you are free to use slightly lengthier YML array construct as show below

locatorGroups:

  groupA:
    locators:
       - name: acme
         url: "/defs/examples/page/acme-snapshot.html"
       - name: exPage
         url: "http://example.org"

TaskGroups

Once locator is loaded Scoopi has to run some task on it and taskGroups property is used to define task to execute for the page loaded by the locator.

The snippet from example job.yml with locatorGroups and taskGroups is

defs/examples/fin/jsoup/quickstart/job.yml

taskGroups:

  snapshotGroup:
    priceTask:
      dataDef: price

The taskGroups defines a task group named snapshotGroup. The task group has a task named priceTask with a property named dataDef and its value is price.

Scoopi executes this task to all locators defined for snapshotGroup in locatorGroups.

At this point, Scoopi knows

  • which pages to download or load
  • which tasks to execute for which page
  • which dataDef to use for a task

In the next chapter, we describe dataDefs which is used to parse the data from the page.