Defs, Locators and Tasks

Scoopi uses YML definition files to extract data from HTML pages. To learn the YML elements used by the definition files, Scoopi distribution comes with a set of examples which are under def/examples folder.

Scoopi Definition Files

Scoopi creates the data model based on YML definition files. We can specify the definition file using scoopi.defs.dir configuration property, which is normally set in file located in conf folder. By default, it is set to defs/examples/fin/jsoup/quickstart which loads the quickstart example. As we progress through the examples, you need to edit conf/ file and set property scoopi.defs.dir to the specific example to run it.

Def file

The def file defines the definition required to run Scoopi. In examples, we have named the definition file as job.yml but it can be named anything as long as file extension is yml. In otherwords, any file from the defs directory with file extension yml is loaded by scooopi as definition file.

The top level elements in the job.yml are

  • locatorGroups
  • taskGroups
  • dataDefs

In this chapter, we go through Quickstart job.xml and explain locatorGroups and taskGroups elements. Refer Scoopi Installation to know how to run Scoopi and examples.


LocatorGroups defines list of locators. The locator specifies the name and URL of the HTML page to fetch from the web or local file system.

In the example job.yml, the locatorGroups is defined as



    locators: [
      { name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" }

It defines a locatorGroup named snapshotGroup which in turn defines one locator. The locator name is acme and its url points to local HTML file acme-snapshot.html which is in defs/examples/fin/page folder.

Here is one more example with two groups


    locators: [
       { name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" },
       { name: exPage, url: "" }    

    locators: [
       { name: acme, url: "/defs/examples/fin/page/acme-bs.html" }  

It defines two locatorGroups named groupA and groupB. The first group defines two locators and the second group defined one locator. To scrape pages from website, specify the actual address of the page such as

Please note that in the above examples we have used JSON array construct using [ ] and {} as we can define one locator per line. But, you are free to use slightly lengthier YML array construct as show below


       - name: acme
         url: "/defs/examples/page/acme-snapshot.html"
       - name: exPage
         url: ""


Once locator is loaded Scoopi has to run some task on it and taskGroups property is used to define task to execute for the page loaded by the locator.

The snippet from example job.yml with locatorGroups and taskGroups is



      dataDef: price

The taskGroups defines a task group named snapshotGroup. The task group has a task named priceTask with a property named dataDef and its value is price.

Scoopi executes this task to all locators defined for snapshotGroup in locatorGroups.

At this point, Scoopi knows

  • which pages to download or load
  • which tasks to execute for which page
  • which dataDef to use for a task

In the next chapter, we describe dataDefs which is used to parse the data from the page.