Create Locators from Links
The definitions would become lengthy when we define each and every link in job.xml. Instead, we can use Scoopi to scrape links from a start page and dynamically create locators. This feature allows you to recursively scrape the web pages. Let’s see how to create locators from scraped links.
Link Scrape Step
The Example 9 scrapes Balance Sheet and Profit & Loss links from acme-quote.html page. Links snippet in the html page is as below.
recursively
defs/examples/fin/page/acme-snapshot-links.html
<!-- links to other pages -->
<div id="page_links">
<li><strong>Financial</strong></li>
<li><a href="acme-bs.html">Balance Sheet</a></li>
<li><a href="acme-pl.html">Profit & Loss</a></li>
</div>
In job.xml, instead of locators for bs and pl, we just define locator for acme-quote-links.html and a task to scrape and convert links.
locatorGroups:
quoteGroup:
locators: <a href="https://github.com/maithilish/scoopi-scraper/blob/master/engine/src/main/resources/defs/examples/fin/jsoup/ex-10/job.yml" target="_blank">
{ name: acme, url: "/defs/examples/fin/page/acme-snapshot-links.html" }
]
taskGroups:
quoteGroup:
linkTask:
dataDef: links
steps:
jsoupDefault:
process:
class: "org.codetab.scoopi.step.convert.LocatorCreator"
previous: parser
next: seeder
In default steps jsoupDefault, the parser step handover data to process step to filter which in turn handover the filtered data to converter. The work flow is
seeder -> loader -> parser -> process (filter) -> converter
But, the task linkTask override process step of jsoupDefault with a
local step which executes
org.codetab.scoopi.step.convert.LocatorCreator
which creates a new
locator from the links and hand them over to seeder step and the work
flow
becomes
seeder -> loader -> parser -> process (locator creator) -> seeder -> loader -> parser -> process (filter) -> converter
DataDef for Links
The linkTask used dataDef named links where we define the member to hold the scraped link which as follows
links:
axis:
fact:
query:
region: "#page_links > table > tbody > tr > td:nth-child(4) > ul"
field: "li:nth-child(%{row.index}) > a[href]"
attribute: "href"
prefix: [ "/defs/examples/page/" ]
col:
query:
script: "configs.getRunDateTime()"
members: [
member: {name: date},
]
row:
query:
region: "#page_links > table > tbody > tr > td:nth-child(4) > ul"
field: "li:nth-child(%{row.index}) > a[href]"
members: [
member: {name: bs, index: 2, linkGroup: bsGroup},
member: {name: pl, index: 3, linkGroup: plGroup},
]
The row axis defines two members bs and pl to hold the scraped links. The linkGroup is name of the task group that has to set to the newly created locator. Let’s clarify this aspect in detail. The task group of the task linkTask is quoteGroup. So the parsed member bs initially belongs to task group quoteGroup. But any dataDef of tasks in quoteGroup are not able to parse the acme-bs.html page. Only tasks in bsGroup are able to parse the acme-bs.html as they use bs dataDef. So we need to assign newly created bs locator to bsGroup which is specified using linkGroup property of member.
|----- quoteGroup -----| bsGroup
seeder -> loader -> parser -> process (locator creator) -> seeder -> loader -> parser -> process (filter) -> converter
Prefix
The web pages uses absolute or relative links. The example acme-quote-links.html uses relative links as shown below.
defs/examples/fin/page/acme-snapshot-links.html
<!-- links to other pages -->
<div id="page_links">
<li><strong>Financial</strong></li>
<li><a href="acme-bs.html">Balance Sheet</a></li>
<li><a href="acme-pl.html">Profit & Loss</a></li>
</div>
We have to prefix path /defs/examples/page/
to link value
acme-bs.html
otherwise loader is not able to load the page. Use prefix
property in axis to add any prefix to axis value. In the example, as
scraped link value is in fact axis it defines fact axis as
links:
axis:
fact:
query:
region: "#page_links > table > tbody > tr > td:nth-child(4) > ul"
field: "li:nth-child(%{row.index}) > a[href]"
attribute: "href"
prefix: [ "/defs/examples/page/" ]
The [Example 10 combines all the definitions we have used so far (examples 1 to 9) - links, price, snapshot, bs and pl - into single job which outputs all the data to data.txt file.
In this next chapter we setup database and use persistence feature of Scoopi.