So far, we explored locators which defines the pages to scrape and dataDef which defines data to scrape from the pages. But, we haven’t defined how that work has to be carried out. This chapter defines Tasks and Steps which executes series of steps that use dataDef to scrape data from locator.
Tasks and Task
The defs/examples/jsoup/ex-5/job.xml defines one task which applies dataDef named bs to locators of same group.
In job.xml, the tasks are defined inside a <fields> element. As exaplined in the previous chapter, the namespace specification xmlns=“http://codetab.org/xfields" in fields allows us to define tasks and step fields without xf: prefix.
<locators group="bs">
<locator name="Acme" url="/defs/examples/page/acme-bs.html" />
</locators>
<fields name="locator" class="org.codetab.gotz.model.Locator"
xmlns="http://codetab.org/xfields">
<tasks name="bs tasks" group="bs">
<task name="bs" dataDef="bs">
.... steps ....
</task>
</tasks>
</fields>
The tasks group attribute is used to link tasks with the a group of locators. In the example, tasks defined for group bs is executed for all locators of same group.
Tasks may contain any number of task definitions. Here we are defining single task which applies dataDef named bs to the locator of same group.
Steps
Each task should define steps that needs to be executed.
Task <steps> element defines sequence of <step> elements. The steps snippet from job.xml is shown below
<fields name="locator" class="org.codetab.gotz.model.Locator"
xmlns="http://codetab.org/xfields">
<tasks name="bs tasks" group="bs">
<task name="bs" dataDef="bs">
<steps name="process steps">
<step name="seeder"
class="org.codetab.gotz.step.extract.LocatorSeeder">
<nextStep>loader</nextStep>
</step>
<step name="loader"
class="org.codetab.gotz.step.extract.URLLoader">
<nextStep>parser</nextStep>
</step>
<step name="parser"
class="org.codetab.gotz.step.extract.JSoupHtmlParser">
<nextStep>filter</nextStep>
</step>
<step name="filter"
class="org.codetab.gotz.step.convert.DataFilter">
<nextStep>converter</nextStep>
</step>
<step name="converter"
class="org.codetab.gotz.step.convert.DataConverter">
<nextStep>appender</nextStep>
</step>
<step name="appender"
class="org.codetab.gotz.step.load.DataAppender">
<nextStep>end</nextStep>
</step>
</steps>
</task>
</tasks>
</fields>
The sequence of steps are
- seeder
- loader
- parser
- filter
- converter
- appender
Each step specifies three things
- name of the step
- class which handles the step
- name of the next step
The list of built-in steps classes are
Step | Desc | Package | Class |
---|---|---|---|
seeder | create and seed locators | org.codetab.gotz.step.extract | LocatorSeeder |
loader | load HTML page pointed by locator | org.codetab.gotz.step.extract | URLLoader |
parser | parse the page loaded | org.codetab.gotz.step.extract | JSoupHtmlParser to use JSoup or HtmlParser to use HtmlUnit |
filter | filter parsed data | org.codetab.gotz.step.convert | DataFilter |
converter | convert data model and apply converters | org.codetab.gotz.step.convert | DataConverter |
appender | encode and append data as output | org.codetab.gotz.step.load | DataAppender |
The class name should be fully qualified including package otherwise class not found error is thrown.
Tasks can define any number of task for a locator group and in the next chapter we show how to do that.