Steps

So far, we explored locators, tasks and dataDef to scrape data from the pages. But, we haven’t explained how Scoopi executes tasks and scrape data.

Scoopi is designed to execute tasks as workflow which is normally referred as steps which in turn consists of multiple step. Scoopi ships with two in-built defaults steps jsoupDefault and htmlUnitDefault and they are defined in steps-default.yml which is packaged inside Scoopi distribution jar. We can access the contents of steps-default.yml from the source.

Steps

Let’s go through the jsoupDefault to understand the workflow design.

steps-default.yml

steps: 
  jsoupDefault:
    seeder:
      class: "org.codetab.scoopi.step.extract.LocatorSeeder"
      previous: start
      next: loader
    loader:
      class: "org.codetab.scoopi.step.extract.PageLoader"
      previous: seeder
      next: parser
    parser:
      class: "org.codetab.scoopi.step.parse.jsoup.Parser"
      previous: loader
      next: filter
    filter:
      class: "org.codetab.scoopi.step.process.DataFilter"
      previous: parser
      next: appender    
    appender:
      class: "org.codetab.scoopi.step.load.DataAppender"
      previous: filter
      next: end
      plugins: [
        plugin: { 
          name: dataFile, 
          class: "org.codetab.scoopi.plugin.appender.FileAppender",
          file: "output/data.txt", 
          plugins: [ 
             plugin: { 
               name: csv,
               delimiter: "|",               
               class: "org.codetab.scoopi.plugin.encoder.CsvEncoder"
             } 
          ]
        }          
      ]

The sequence of step is

  1. seeder
  2. loader
  3. parser
  4. filter
  5. appender

Each step specifies three properties

  • class that has to be executed for the step
  • name of the previous
  • name of the next step

For the the first step the previous is start and the last step next is set to end.

The list of built-in step classes are

Step typeDescriptionClass
seedercreate and seed locatorsorg.codetab.scoopi.step.extract.LocatorSeeder
loaderload HTML pageorg.codetab.scoopi.step.extract.URLLoader
parserparse using JSouporg.codetab.scoopi.step.parse.jsoup.Parser
parserparse with HtmlUnitorg.codetab.scoopi.step.parse.htmlunit.Parser
filterfilter parsed dataorg.codetab.scoopi.step.process.DataFilter
appenderencode and append data as outputorg.codetab.scoopi.step.load.DataAppender

The class name should be fully qualified including package otherwise class not found error is thrown.

Plugins

The last step appender is bit interesting. It uses FileAppender plugin to append data to output file which in turn uses another plugin CsvEncoder plugin to encode data into string delimited with | character before sending the output to file.

appender:
      class: "org.codetab.scoopi.step.load.DataAppender"
      previous: filter
      next: end
      plugins: [
        plugin: { 
          name: dataFile, 
          class: "org.codetab.scoopi.plugin.appender.FileAppender",
          file: "output/data.txt", 
          plugins: [ 
             plugin: { 
               name: csv,
               delimiter: "|",               
               class: "org.codetab.scoopi.plugin.encoder.CsvEncoder"
             } 
          ]
        }          
      ]

Plugins framework allows Scoopi to get configuration from definition file and execute any plugin class without modifying the source code and Scoopi ships with following plugins.

Plugin typeDescriptionPlugin class
encoderencodes data as csvorg.codetab.scoopi.plugin.encoder.CsvEncoder
appenderappends data to fileorg.codetab.scoopi.plugin.appender.FileAppender
appenderappends data to ListArrayorg.codetab.scoopi.plugin.appender.ListAppender
converterchange date formatorg.codetab.scoopi.plugin.converter.DateFormater
converterroll date and change formatorg.codetab.scoopi.plugin.converter.DateRoller
scriptrun JavaScript to modify dataorg.codetab.scoopi.plugin.script.DataScript

In subsequent chapters we explain how to override default steps or add new one and plugins. The next chapter uses converter plugin to format dates.