Defs, Locators and Tasks

Scoopi uses set of YML definition files to extract data from HTML pages. To learn the YML elements used by the definition files, Scoopi distribution comes with a set of examples which are under def/examples/jsoup folder. Examples are named as ex-1, ex-2 and so on, each with increasing complexity.

Scoopi Definition Files

Scoopi creates the data model based on YML definition files. We can specify the definition file using scoopi.defs.dir configuration property, which is normally set in scoopi.properties file located in conf folder. By default, it is set to defs/examples/jsoup/ex-1 which loads the example 1. As we progress through the examples, you need to edit conf/scoopi.properties file and set property scoopi.defs.dir to the specific example.

Def file

The def file defines the definition required to run Scoopi. In examples, we have named the definition file as job.yml but it can be named anything as long as file extension is yml. In otherwords, any file from the defs directory with file extension yml is loaded by scooopi as definition file.

The top level elements in the job.yml are

  • locatorGroups
  • taskGroups
  • dataDefs

In this chapter, we go through Example-1 job.xml and explain locatorGroups and taskGroups elements. Refer Scoopi Installation to know how to run Scooopi and examples.

LocatorGroups

LocatorGroups defines list of locators and locator specifies the name and URL of the HTML page to fetch from the Internet or local file system.

In the example job.yml, the locatorGroups is defined as

defs/examples/fin/jsoup/ex-1/job.yml

locatorGroups:

  quoteGroup:
    locators: [
       { name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" }  
    ]

It defines a locatorGroups named quoteGroup which in turn defines one locator. The locator name is acme and its url points to local HTML file acme-quote.html which is in defs/examples/page folder.

Here is one more example with two groups

locatorGroups:

  groupA:
    locators: [
       { name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" },
       { name: exPage, url: "http://example.org" }    
    ]

  groupB:
    locators: [
       { name: acme, url: "/defs/examples/fin/page/acme-bs.html" }  
    ]

It defines two locatorGroups named groupA and groupB. The first group defines two locators and the second group defined one locator. To scrape pages from website, we need to specify the actual address of the page such as http://example.org.

Please note that in the above examples we have used JSON array construct using [ ] and {} as we can define one locator per line. Alternativley, you are free to use slightly lengthier YML array construct as show below

locatorGroups:

  groupA:
    locators:
       - name: acme
         url: "/defs/examples/fin/page/acme-snapshot.html"
       - name: exPage
         url: "http://example.org"

TaskGroups

TaskGroups property is used to define task which has to be executed for the page loaded by the locator.

The snippet from example job.yml with locatorGroups and taskGroups is

defs/examples/fin/jsoup/ex-1/job.yml

locatorGroups:

  quoteGroup:
    locators: [
       { name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" }  
    ]

taskGroups:

  quoteGroup:
    priceTask:
      dataDef: price

The taskGroups defines a task group named quoteGroup. The task group has a task named priceTask with a property named dataDef and its value is price.

Scoopi executes this task to all locators defined for quoteGroup in locatorGroups. The above example defines only one locator for the group quoteGroup and task gets executed for the HTML page loaded by that locator.

At this point, Scoopi knows

  • which pages to download or load
  • which tasks to execute for which page
  • the dataDef to use for a task

In the next chapter, we describe dataDefs which is used to parse the data from the page.