Steps
So far, we explored locators to define the pages to scrape, tasks to execute and dataDef to parse data from the pages. But, we haven’t explained how Scoopi executes tasks and scrape data.
Scoopi is designed to execute tasks as workflow which is normally
referred as steps which in turn consists of multiple step. By
default, Scoopi ships with two defaults steps jsoupDefault and
htmlUnitDefault and they are defined in steps-default.yml
which is
packaged inside scoopi distribution jar. We can access the contents of
steps-default.yml from source
repository.
Steps
Let’s go through the jsoupDefault steps to understand the workflow design.
steps-default.yml
steps:
jsoupDefault:
seeder:
class: "org.codetab.scoopi.step.extract.LocatorSeeder"
previous: start
next: loader
loader:
class: "org.codetab.scoopi.step.extract.URLLoader"
previous: seeder
next: parser
parser:
class: "org.codetab.scoopi.step.parse.jsoup.Parser"
previous: loader
next: process
process:
class: "org.codetab.scoopi.step.convert.DataFilter"
previous: parser
next: converter
converter:
class: "org.codetab.scoopi.step.convert.DataConverter"
previous: process
next: appender
appender:
class: "org.codetab.scoopi.step.load.DataAppender"
previous: converter
next: end
plugins: [
plugin: {
name: file,
class: "org.codetab.scoopi.plugin.appender.FileAppender",
file: "output/data.txt",
plugins: [
plugin: {
name: csv,
delimiter: "|",
class: "org.codetab.scoopi.plugin.encoder.CsvEncoder"
}
]
}
]
The sequence of step is
- seeder
- loader
- parser
- process
- converter
- appender
Each step specifies three things
- class that has to be executed for the step
- name of the previous
- name of the next step
For the the first step the previous is start and the last step next is set to end.
The list of built-in step classes are
Step type | Description | Class |
---|---|---|
seeder | create and seed locators | org.codetab.scoopi.step.extract.LocatorSeeder |
loader | load HTML page | org.codetab.scoopi.step.extract.URLLoader |
parser | parse using JSoup | org.codetab.scoopi.step.parse.jsoup.Parser |
parser | parse with HtmlUnit | org.codetab.scoopi.step.parse.htmlunit.Parser |
process | filter parsed data | org.codetab.scoopi.step.convert.DataFilter |
converter | convert data and apply converters | org.codetab.scoopi.step.convert.DataConverter |
appender | encode and append data as output | org.codetab.scoopi.step.load.DataAppender |
The class name should be fully qualified including package otherwise class not found error is thrown.
Plugins
The last step appender is bit interesting. It uses FileAppender plugin to append data to output file which in turn uses another plugin CsvEncoder plugin to encode data into string delimited with | character before sending the output to file.
appender:
class: "org.codetab.scoopi.step.load.DataAppender"
previous: converter
next: end
plugins: [
plugin: {
name: file,
class: "org.codetab.scoopi.plugin.appender.FileAppender",
file: "output/data.txt",
plugins: [
plugin: {
name: csv,
delimiter: "|",
class: "org.codetab.scoopi.plugin.encoder.CsvEncoder"
}
]
}
]
Plugins framework allows Scoopi to get configuration from definition file and execute any plugin class without modifying the source code. As of now, Scoopi ships with following plugins.
Plugin type | Description | Plugin class |
---|---|---|
encoder | encodes data as csv | org.codetab.scoopi.plugin.encoder.CsvEncoder |
appender | appends data to file | org.codetab.scoopi.plugin.appender.FileAppender |
appender | appends data to ListArray | org.codetab.scoopi.plugin.appender.ListAppender |
converter | change date format | org.codetab.scoopi.plugin.converter.DateFormater |
converter | roll date and change format | org.codetab.scoopi.plugin.converter.DateRoller |
The next chapter shows how to override default steps and use converter plugins to format dates.