So far, we explored locators which defines the pages to scrape and dataDef which defines data to scrape from the pages. But, we haven’t defined how that work has to be carried out. This chapter defines Tasks and Steps which executes series of steps that use dataDef to scrape data from locator.

 
 

Tasks and Task

The defs/examples/jsoup/ex-5/job.xml defines one task which applies dataDef named bs to locators of same group.

In job.xml, the tasks are defined inside a <fields> element. As exaplined in the previous chapter, the namespace specification xmlns=“http://codetab.org/xfields" in fields allows us to define tasks and step fields without xf: prefix.

<locators group="bs">
    <locator name="Acme" url="/defs/examples/page/acme-bs.html" />
</locators>

<fields name="locator" class="org.codetab.gotz.model.Locator"
    xmlns="http://codetab.org/xfields">

    <tasks name="bs tasks" group="bs">
        <task name="bs" dataDef="bs">
            .... steps ....
        </task>
    </tasks>
</fields>

The tasks group attribute is used to link tasks with the a group of locators. In the example, tasks defined for group bs is executed for all locators of same group.

Tasks may contain any number of task definitions. Here we are defining single task which applies dataDef named bs to the locator of same group.

 
 

Steps

Each task should define steps that needs to be executed.

Task <steps> element defines sequence of <step> elements. The steps snippet from job.xml is shown below

<fields name="locator" class="org.codetab.gotz.model.Locator"
    xmlns="http://codetab.org/xfields">

    <tasks name="bs tasks" group="bs">
        <task name="bs" dataDef="bs">
            <steps name="process steps">
                <step name="seeder"
                    class="org.codetab.gotz.step.extract.LocatorSeeder">
                    <nextStep>loader</nextStep>
                </step>
                <step name="loader"
                    class="org.codetab.gotz.step.extract.URLLoader">
                    <nextStep>parser</nextStep>
                </step>
                <step name="parser"
                    class="org.codetab.gotz.step.extract.JSoupHtmlParser">
                    <nextStep>filter</nextStep>
                </step>
                <step name="filter"
                    class="org.codetab.gotz.step.convert.DataFilter">
                    <nextStep>converter</nextStep>
                </step>
                <step name="converter"
                    class="org.codetab.gotz.step.convert.DataConverter">
                    <nextStep>appender</nextStep>
                </step>
                <step name="appender"
                    class="org.codetab.gotz.step.load.DataAppender">
                    <nextStep>end</nextStep>
                </step>
            </steps>
        </task>
    </tasks>
</fields>

The sequence of steps are

  1. seeder
  2. loader
  3. parser
  4. filter
  5. converter
  6. appender

Each step specifies three things

  • name of the step
  • class which handles the step
  • name of the next step

The list of built-in steps classes are

StepDescPackageClass
seedercreate and seed locatorsorg.codetab.gotz.step.extractLocatorSeeder
loaderload HTML page pointed by locatororg.codetab.gotz.step.extractURLLoader
parserparse the page loadedorg.codetab.gotz.step.extractJSoupHtmlParser to use JSoup or HtmlParser to use HtmlUnit
filterfilter parsed dataorg.codetab.gotz.step.convertDataFilter
converterconvert data model and apply convertersorg.codetab.gotz.step.convertDataConverter
appenderencode and append data as outputorg.codetab.gotz.step.loadDataAppender

The class name should be fully qualified including package otherwise class not found error is thrown.

Tasks can define any number of task for a locator group and in the next chapter we show how to do that.