Multiple Tasks

Scoopi can execute multiple tasks for a locator and also multiple tasks on multiple locatorGroups.

Multiple tasks and single Locator group

The Example-1 scrape Price data from acme-quote.html page while Example-2 extracts Snapshot data from the same page. One option is to define acme-quote.html locator in two locatorGroups and assign task to each of them. This unnecessarily downloads acme-quote.html twice. Instead, it is better to define single locatorGroup and assign two tasks

  • priceTask and snapshotTask so that page downloads only once.

The Example 6 extracts price and snapshot data from acme-quote.html page.

To run multiple task we define taskGroups as below

defs/examples/fin/jsoup/ex-6/job.yml

locatorGroups:

  quoteGroup:
    locators: <a href="https://github.com/maithilish/scoopi-scraper/blob/master/engine/src/main/resources/defs/examples/jsoup/ex-7/job.yml" target="_blank">
       { name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" }
    ]

taskGroups:

  quoteGroup:

    priceTask:
      dataDef: price

    snapshotTask:
      dataDef: snapshot

The above snippet defines two task, first one priceTask applies price dataDef and the next one snapshotTaks applies snapshot dataDef on locators of quoteGroup.

Multiple tasks and multiple locator groups

The [Example 7 extends on previous one which executes multiple task on multiple locators group. It scrapes price and snapshot data from acme-quote.html and bs data from acme-bs.html page. Task and locator snippet from the example is as below.

defs/examples/jsoup/ex-7/job.yml

locatorGroups:

  quoteGroup:
    locators: [
       { name: acme, url: "/defs/examples/fin/page/acme-snapshot.html" }
    ]

  bsGroup:
    locators: [
       { name: acme, url: "/defs/examples/fin/page/acme-bs.html" }
    ]

taskGroups:

  quoteGroup:

    priceTask:
      dataDef: price

    snapshotTask:
      dataDef: snapshot

  bsGroup:

    bsTask:
      dataDef: bs

It defines two locator groups and task groups - quoteGroup and bsGroup. The task group, quoteGroup defines two tasks priceTask and snapshotTask and second task group, bsGroup define single task bsTask.

In all, Scoopi executes three tasks

  1. priceTask parses locator acme-quote.html with price dateDef.
  2. snapshotTask again parses same instance of acme-quote.html with snapshot dateDef.
  3. bsTask parses acme-bs.html with bs dateDef.

Output of all the three tasks go to output/data.txt.

But, there is a problem in output data - the date for bs data is in MMM ‘YY format and price and snapshot date i ISO Date format. If we try to import the output file to database, it fails.

We have to plugin converter after filter and before data is appended to output file to format or change value. In the next chapter, we explain Scoopi workflow, step and plugin design and show how to override default workflow steps.