Steps

So far, we explored locators, tasks and dataDef to scrape data from the pages. But, we haven’t explained how Scoopi executes tasks and scrape data.

Scoopi is designed to execute tasks as workflow which is normally referred as steps which in turn consists of multiple step. Scoopi ships with two in-built defaults steps jsoupDefault and htmlUnitDefault and they are defined in steps-default.yml which is packaged inside Scoopi distribution jar. We can access the contents of steps-default.yml from the source.

Steps

Let’s go through the jsoupDefault to understand the workflow design.

steps-default.yml

steps: 
  jsoupDefault:
    seeder:
      class: "org.codetab.scoopi.step.extract.LocatorSeeder"
      previous: start
      next: loader
    loader:
      class: "org.codetab.scoopi.step.extract.PageLoader"
      previous: seeder
      next: parser
    parser:
      class: "org.codetab.scoopi.step.parse.jsoup.Parser"
      previous: loader
      next: filter
    filter:
      class: "org.codetab.scoopi.step.process.DataFilter"
      previous: parser
      next: appender    
    appender:
      class: "org.codetab.scoopi.step.load.DataAppender"
      previous: filter
      next: end
      plugins: [
        plugin: { 
          name: dataFile, 
          class: "org.codetab.scoopi.plugin.appender.FileAppender",
          file: "output/data.txt", 
          plugins: [ 
             plugin: { 
               name: csv,
               delimiter: "|",               
               class: "org.codetab.scoopi.plugin.encoder.CsvEncoder"
             } 
          ]
        }          
      ]

The sequence of step is

  1. seeder
  2. loader
  3. parser
  4. filter
  5. appender

Each step specifies three properties

  • class that has to be executed for the step
  • name of the previous
  • name of the next step

For the the first step the previous is start and the last step next is set to end.

The list of built-in step classes are

Step type Description Class
seeder create and seed locators org.codetab.scoopi.step.extract.LocatorSeeder
loader load HTML page org.codetab.scoopi.step.extract.URLLoader
parser parse using JSoup org.codetab.scoopi.step.parse.jsoup.Parser
parser parse with HtmlUnit org.codetab.scoopi.step.parse.htmlunit.Parser
filter filter parsed data org.codetab.scoopi.step.process.DataFilter
appender encode and append data as output org.codetab.scoopi.step.load.DataAppender

The class name should be fully qualified including package otherwise class not found error is thrown.

Plugins

The last step appender is bit interesting. It uses FileAppender plugin to append data to output file which in turn uses another plugin CsvEncoder plugin to encode data into string delimited with | character before sending the output to file.

appender:
      class: "org.codetab.scoopi.step.load.DataAppender"
      previous: filter
      next: end
      plugins: [
        plugin: { 
          name: dataFile, 
          class: "org.codetab.scoopi.plugin.appender.FileAppender",
          file: "output/data.txt", 
          plugins: [ 
             plugin: { 
               name: csv,
               delimiter: "|",               
               class: "org.codetab.scoopi.plugin.encoder.CsvEncoder"
             } 
          ]
        }          
      ]

Plugins framework allows Scoopi to get configuration from definition file and execute any plugin class without modifying the source code and Scoopi ships with following plugins.

Plugin type Description Plugin class
encoder encodes data as csv org.codetab.scoopi.plugin.encoder.CsvEncoder
appender appends data to file org.codetab.scoopi.plugin.appender.FileAppender
appender appends data to ListArray org.codetab.scoopi.plugin.appender.ListAppender
converter change date format org.codetab.scoopi.plugin.converter.DateFormater
converter roll date and change format org.codetab.scoopi.plugin.converter.DateRoller
script run JavaScript to modify data org.codetab.scoopi.plugin.script.DataScript

In subsequent chapters we explain how to override default steps or add new one and plugins. The next chapter uses converter plugin to format dates.