Create Locators from Links

The definitions would become lengthy when we define each and every link in job.xml. Instead, we can use Scoopi to scrape links from a start page and dynamically create locators. This feature allows you to recursively scrape the web pages. Let’s see how to create locators from scraped links.

The Example 9 scrapes Balance Sheet and Profit & Loss links from acme-quote.html page. Links snippet in the html page is as below.

recursively

defs/examples/fin/page/acme-snapshot-links.html


<!-- links to other pages -->
<div id="page_links">
     <li><strong>Financial</strong></li>
     <li><a href="acme-bs.html">Balance Sheet</a></li>
     <li><a href="acme-pl.html">Profit & Loss</a></li>
</div>

In job.xml, instead of locators for bs and pl, we just define locator for acme-quote-links.html and a task to scrape and convert links.


locatorGroups:

  quoteGroup:
    locators: <a href="https://github.com/maithilish/scoopi-scraper/blob/master/engine/src/main/resources/defs/examples/fin/jsoup/ex-10/job.yml" target="_blank">
       { name: acme, url: "/defs/examples/fin/page/acme-snapshot-links.html" }
    ]

taskGroups:

  quoteGroup:

    linkTask:
      dataDef: links
      steps:
        jsoupDefault:
          process:
            class: "org.codetab.scoopi.step.convert.LocatorCreator"
            previous: parser
            next: seeder

In default steps jsoupDefault, the parser step handover data to process step to filter which in turn handover the filtered data to converter. The work flow is

seeder -> loader -> parser -> process (filter) -> converter

But, the task linkTask override process step of jsoupDefault with a local step which executes org.codetab.scoopi.step.convert.LocatorCreator which creates a new locator from the links and hand them over to seeder step and the work flow becomes

 seeder -> loader -> parser -> process (locator creator) -> seeder -> loader -> parser -> process (filter) -> converter

The linkTask used dataDef named links where we define the member to hold the scraped link which as follows


links:
  axis:
    fact:
      query:
        region: "#page_links > table > tbody > tr > td:nth-child(4) > ul"
        field: "li:nth-child(%{row.index}) > a[href]"
        attribute: "href"
      prefix: [ "/defs/examples/page/" ]
    col:
      query:
        script: "configs.getRunDateTime()"
      members: [
        member: {name: date},
      ]
    row:
      query:
        region: "#page_links > table > tbody > tr > td:nth-child(4) > ul"
        field: "li:nth-child(%{row.index}) > a[href]"
      members: [
        member: {name: bs, index: 2, linkGroup: bsGroup},
        member: {name: pl, index: 3, linkGroup: plGroup},
      ]

The row axis defines two members bs and pl to hold the scraped links. The linkGroup is name of the task group that has to set to the newly created locator. Let’s clarify this aspect in detail. The task group of the task linkTask is quoteGroup. So the parsed member bs initially belongs to task group quoteGroup. But any dataDef of tasks in quoteGroup are not able to parse the acme-bs.html page. Only tasks in bsGroup are able to parse the acme-bs.html as they use bs dataDef. So we need to assign newly created bs locator to bsGroup which is specified using linkGroup property of member.

 |-----            quoteGroup                       -----|                     bsGroup 

 seeder -> loader -> parser -> process (locator creator) -> seeder -> loader -> parser -> process (filter) -> converter

Prefix

The web pages uses absolute or relative links. The example acme-quote-links.html uses relative links as shown below.

defs/examples/fin/page/acme-snapshot-links.html


<!-- links to other pages -->
<div id="page_links">
     <li><strong>Financial</strong></li>
     <li><a href="acme-bs.html">Balance Sheet</a></li>
     <li><a href="acme-pl.html">Profit & Loss</a></li>
</div>

We have to prefix path /defs/examples/page/ to link value acme-bs.html otherwise loader is not able to load the page. Use prefix property in axis to add any prefix to axis value. In the example, as scraped link value is in fact axis it defines fact axis as


links:
  axis:
    fact:
      query:
        region: "#page_links > table > tbody > tr > td:nth-child(4) > ul"
        field: "li:nth-child(%{row.index}) > a[href]"
        attribute: "href"
      prefix: [ "/defs/examples/page/" ]

The [Example 10 combines all the definitions we have used so far (examples 1 to 9) - links, price, snapshot, bs and pl - into single job which outputs all the data to data.txt file.

In this next chapter we setup database and use persistence feature of Scoopi.