Create Locators from Links

The definitions would become lengthy when we define each and every link in job.xml. Instead, Scoopi can scrape links from any page and dynamically create locators. This feature allows you to recursively scrape the web pages. Let’s see how to create locators from scraped links.

The Example 9 scrapes Balance Sheet and Profit & Loss links from acme-snapshot.html page. Links snippet in the html page is as below.

defs/examples/fin/page/acme-snapshot-links.html


<!-- links to other pages -->
<div id="page_links">
     <li><strong>Financial</strong></li>
     <li><a href="acme-bs.html">Balance Sheet</a></li>
     <li><a href="acme-pl.html">Profit & Loss</a></li>
</div>

In job.xml, instead of locators for bs and pl, we just define locator for acme-snapshot-links.html and task named linkTask to scrape and convert links.

locatorGroups:
  snapshotGroup:
    locators:
       { name: acme, url: "/defs/examples/fin/page/acme-snapshot-links.html" }
    ]

taskGroups:
  snapshotGroup:

    priceTask:
      dataDef: price

    linkTask:
      dataDef: links
      steps:
        jsoupDefault:
          process:
            class: "org.codetab.scoopi.step.process.LocatorCreator"
            previous: parser
            next: seeder

In jsoupDefault steps the parser step handover data to filter filter which in turn handover the filtered data to appender. The work flow is

seeder -> loader -> parser -> filter -> appender

But, the task linkTask override inserts a new process step with step class org.codetab.scoopi.step.convert.LocatorCreator which creates a new locator from the parsed link and hands over it to seeder step and the work flow becomes

 seeder -> loader -> parser -> process (locator creator) -> seeder -> loader -> parser -> filter -> appender

The linkTask used dataDef named links where we defines link to scrape which is as follows

links:
    query:
      block: "#page_links > table > tbody > tr > td:nth-child(4) > ul"
    items: [
      item: { name: "link", linkGroup: bsGroup, index: 2,
              selector: "li:nth-child(%{index}) > a attribute: href",
              prefix: [ "/defs/examples/fin/page/" ] },
    ]

The axis item defines data item to hold scraped links. The linkGroup is name of the task group that has to set to the newly created locator. Let’s clarify this aspect in detail. The task group of the linkTask is snapshotGroup. So the parsed link initially belongs to task group snapshotGroup. But any dataDef defined in snapshotGroup are not able to parse the acme-bs.html page. Only tasks in bsGroup are able to parse the acme-bs.html as they use bs dataDef. So we need to assign newly created bs locator to bsGroup which is specified using linkGroup property of member. The groups changes in workflow is show below.

 |-----            snapshotGroup                    -----|                     bsGroup

 seeder -> loader -> parser -> process (locator creator) -> seeder -> loader -> parser -> process (filter) -> converter

Prefix

The web pages uses absolute or relative links. The example acme-snapshot-links.html uses relative links as shown below.

defs/examples/fin/page/acme-snapshot-links.html


<!-- links to other pages -->
<div id="page_links">
     <li><strong>Financial</strong></li>
     <li><a href="acme-bs.html">Balance Sheet</a></li>
     <li><a href="acme-pl.html">Profit & Loss</a></li>
</div>

We have to prefix path /defs/examples/fin/page/ to scraped link value acme-bs.html otherwise loader is not able to load the page. Use prefix property in item to add any prefix to the item value.

The Example 10 combines all the definitions we have used so far (examples 1 to 9) - links, price, snapshot, bs and pl - into single job which outputs all the data to data.txt file.

The next chapter shows how to flip through pages with pagination and scrape data.