Steps

So far, we explored locators to define the pages to scrape, tasks to execute and dataDef to parse data from the pages. But, we haven’t explained how Scoopi executes tasks and scrape data.

Scoopi is designed to execute tasks as workflow which is normally referred as steps which in turn consists of multiple step. By default, Scoopi ships with two defaults steps jsoupDefault and htmlUnitDefault and they are defined in steps-default.yml which is packaged inside scoopi distribution jar. We can access the contents of steps-default.yml from source repository.

Steps

Let’s go through the jsoupDefault steps to understand the workflow design.

steps-default.yml

steps: 

  jsoupDefault:
    seeder:
      class: "org.codetab.scoopi.step.extract.LocatorSeeder"
      previous: start
      next: loader
    loader:
      class: "org.codetab.scoopi.step.extract.URLLoader"
      previous: seeder
      next: parser
    parser:
      class: "org.codetab.scoopi.step.parse.jsoup.Parser"
      previous: loader
      next: process
    process:
      class: "org.codetab.scoopi.step.convert.DataFilter"
      previous: parser
      next: converter
    converter:
      class: "org.codetab.scoopi.step.convert.DataConverter"
      previous: process
      next: appender
    appender:
      class: "org.codetab.scoopi.step.load.DataAppender"
      previous: converter
      next: end
      plugins: [
        plugin: { 
          name: file, 
          class: "org.codetab.scoopi.plugin.appender.FileAppender",
          file: "output/data.txt", 
          plugins: [ 
             plugin: { 
               name: csv,
               delimiter: "|",
               class: "org.codetab.scoopi.plugin.encoder.CsvEncoder"
             } 
          ]
        }          
      ]

The sequence of step is

  1. seeder
  2. loader
  3. parser
  4. process
  5. converter
  6. appender

Each step specifies three things

  • class that has to be executed for the step
  • name of the previous
  • name of the next step

For the the first step the previous is start and the last step next is set to end.

The list of built-in step classes are

Step typeDescriptionClass
seedercreate and seed locatorsorg.codetab.scoopi.step.extract.LocatorSeeder
loaderload HTML pageorg.codetab.scoopi.step.extract.URLLoader
parserparse using JSouporg.codetab.scoopi.step.parse.jsoup.Parser
parserparse with HtmlUnitorg.codetab.scoopi.step.parse.htmlunit.Parser
processfilter parsed dataorg.codetab.scoopi.step.convert.DataFilter
converterconvert data and apply convertersorg.codetab.scoopi.step.convert.DataConverter
appenderencode and append data as outputorg.codetab.scoopi.step.load.DataAppender

The class name should be fully qualified including package otherwise class not found error is thrown.

Plugins

The last step appender is bit interesting. It uses FileAppender plugin to append data to output file which in turn uses another plugin CsvEncoder plugin to encode data into string delimited with | character before sending the output to file.

appender:
  class: "org.codetab.scoopi.step.load.DataAppender"
  previous: converter
  next: end
  plugins: [
    plugin: { 
      name: file, 
      class: "org.codetab.scoopi.plugin.appender.FileAppender",
      file: "output/data.txt", 
      plugins: [ 
         plugin: { 
           name: csv,
           delimiter: "|",
           class: "org.codetab.scoopi.plugin.encoder.CsvEncoder"
         } 
      ]
    }          
  ]

Plugins framework allows Scoopi to get configuration from definition file and execute any plugin class without modifying the source code. As of now, Scoopi ships with following plugins.

Plugin typeDescriptionPlugin class
encoderencodes data as csvorg.codetab.scoopi.plugin.encoder.CsvEncoder
appenderappends data to fileorg.codetab.scoopi.plugin.appender.FileAppender
appenderappends data to ListArrayorg.codetab.scoopi.plugin.appender.ListAppender
converterchange date formatorg.codetab.scoopi.plugin.converter.DateFormater
converterroll date and change formatorg.codetab.scoopi.plugin.converter.DateRoller

The next chapter shows how to override default steps and use converter plugins to format dates.