Scoopi


October 11, 2018 Maithilish

Scoopi Web Scraper

Scoopi web scraper extracts and transform data from HTML pages. JSoup and HtmlUnit makes it quite easy to scrape web pages in Java, but the things get complicated when data is from large number of pages. Some of the challenges while extracting large set of data from unstructured sources such as HTML pages are:

  • Data being unstructured, may requires many queries to scrape them
  • Data may not be in desired format and to make them usable, needs filter and transform
  • Connection may drop during a run and all the work is lost
  • When data is from thousands of pages, performance does matter
  • Need Java or Python proficiency to use scraper libraries

Scraping libraries do well in scraping data from limited set of pages but they are not meant to handle thousands of pages. Scoopi is developed taking these aspects into consideration. It is built upon JSoup and HtmlUnit. Some of the features of Scoopi are

  • Scoopi doesn’t require any coding language skill. Task workflow, pages to scrape and data structure are defined with a set of YAML definition files. It can be configured to use either JSoup or HtmlUnit as scraper
  • Query can be written either using Selectors with JSoup or XPath with HtmlUnit
  • Scoopi persists pages and data to database so that it recovers from the failed state without repeating the tasks that are completed
  • Scoopi is a multi-thread application which process pages in parallel for maximum throughput.
  • Allows to transform, filter and sort the data

For complete list of features see Scoopi GitHub page

In this step-by-step guide, we explain the Scoopi definition file in detail through a set of examples. For the sake of clarity, we have split the guide into fourteen and odd pages, however the overall concept is quite simple and should not take more than a day to learn.

Scoopi Guide

scoopi scraper logo

Maithilish

www.codetab.org

maithilish@gmail.com

Table of Contents

 

Basics
  1. Install Scoopi
  2. Definition files
  3. DataDef
  4. Query, Region and Field
  5. Members and Dynamic Query

 

Advanced usage
  1. IndexRange and BreakAfter
  2. Filters
  3. Multiple Tasks
  4. Steps and Plugins
  5. Converter
  6. Locators from Links
  7. Persistence
  8. Appender and Encoder
  9. Split Definition Files
  10. Logs
  11. Dashboard