Defs, Locators and Tasks
Scoopi uses set of YML definition files to extract data from HTML pages.
To learn the YML elements used by the definition files, Scoopi
distribution comes with a set of examples which are under
def/examples/jsoup
folder. Examples are named as ex-1, ex-2 and so on,
each with increasing complexity.
Scoopi Definition Files
Scoopi creates the data model based on YML definition files. We can
specify the definition file using scoopi.defs.dir configuration
property, which is normally set in scoopi.properties
file located in
conf
folder. By default, it is set to defs/examples/jsoup/ex-1
which
loads the example 1. As we progress through the examples, you need to
edit conf/scoopi.properties
file and set property scoopi.defs.dir to
the specific example.
Def file
The def file defines the definition required to run Scoopi. In examples,
we have named the definition file as job.yml
but it can be named
anything as long as file extension is yml. In otherwords, any file
from the defs directory with file extension yml is loaded by scooopi
as definition file.
The top level elements in the job.yml
are
- locatorGroups
- taskGroups
- dataDefs
In this chapter, we go through Example-1 job.xml and explain locatorGroups and taskGroups elements. Refer Scoopi Installation to know how to run Scooopi and examples.
LocatorGroups
LocatorGroups defines list of locators and locator specifies the name and URL of the HTML page to fetch from the Internet or local file system.
In the example job.yml
, the locatorGroups is defined as
defs/examples/fin/jsoup/ex-1/job.yml
locatorGroups:
quoteGroup:
locators: [
{ name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" }
]
It defines a locatorGroups named quoteGroup which in turn defines one
locator. The locator name is acme and its url points to local HTML
file acme-quote.html
which is in defs/examples/page
folder.
Here is one more example with two groups
locatorGroups:
groupA:
locators: [
{ name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" },
{ name: exPage, url: "http://example.org" }
]
groupB:
locators: [
{ name: acme, url: "/defs/examples/fin/page/acme-bs.html" }
]
It defines two locatorGroups named groupA and groupB. The first group defines two locators and the second group defined one locator. To scrape pages from website, we need to specify the actual address of the page such as http://example.org.
Please note that in the above examples we have used JSON array construct using [ ] and {} as we can define one locator per line. Alternativley, you are free to use slightly lengthier YML array construct as show below
locatorGroups:
groupA:
locators:
- name: acme
url: "/defs/examples/fin/page/acme-snapshot.html"
- name: exPage
url: "http://example.org"
TaskGroups
TaskGroups property is used to define task which has to be executed for the page loaded by the locator.
The snippet from example job.yml
with locatorGroups and taskGroups is
defs/examples/fin/jsoup/ex-1/job.yml
locatorGroups:
quoteGroup:
locators: [
{ name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" }
]
taskGroups:
quoteGroup:
priceTask:
dataDef: price
The taskGroups defines a task group named quoteGroup. The task group has a task named priceTask with a property named dataDef and its value is price.
Scoopi executes this task to all locators defined for quoteGroup in locatorGroups. The above example defines only one locator for the group quoteGroup and task gets executed for the HTML page loaded by that locator.
At this point, Scoopi knows
- which pages to download or load
- which tasks to execute for which page
- the dataDef to use for a task
In the next chapter, we describe dataDefs which is used to parse the data from the page.