CodeTab Gotz ETL

Gotz ETL is a tool to extract data from HTML pages. In Java, it’s easy to scrape web pages with libraries such as JSoup and HtmlUnit, but the task become daunting when we try to scrape data from huge set of pages.

Some of the challenges while extracting large set of data from unstructured sources such as HTML pages are:

  • Single web page may hold multiple types of data, queries should be able to dynamically handle different types of data

  • Net connection may go down in middle of a run and scraper should be able to recover from failed state

  • Some of the sources page may change frequently and others less frequently and scraper should avoid parsing the pages that are not changed otherwise it may take very long time to complete the run

  • Data in unstructured source such as HTML pages may not be in desired format or some data may be unwanted and scraper should be able to filter and transform the values

Scraping libraries such as JSoup and HtmlUnit do well in scraping data but they are not meant to handle the situations listed above.

Gotz is developed taking these aspects into consideration. It is built upon JSoup and HtmlUnit. Functionalities offered by Gotz over and above the scrapping libraries are:

  • Gotz is completely model driven like a real ETL tool. Data structure, task workflow and pages to scrape are defined with a set of XML definition files and no coding is required

  • It can be configured to use either JSoup or HtmlUnit as scraper

  • Queires can be written either using Selectors with JSoup or XPath with HtmlUnit

  • Gotz persists pages and data to database so that it recover from the failed state without repeating the tasks already completed

  • For Transparent persistence, Gotz uses JDO Standard and DataNucleus AccessPlatform and you can choose your Datastore from a very wide range!

  • Gotz is a multi-thread application which process pages in parallel for maximum throughput. Threads allotted to each task pool is configurable based on workload

  • Allows to transform, filter and sort the data

  • Comes with built-in appenders such as FileAppender, DBAppender and ListAppender.

  • GotzEngine can be embedded in other programs and access scrapped data with ListAppender

  • Flexible workflow allows one to change sequence of steps

  • Gotz is extensible. Developers can extends the predefined base steps or even create new ones with different functionality and weave them in workflow

Gotz ETL Reference

Gotz ETL Logo

Maithilish

www.codetab.org

maithilish@gmail.com

Table of Contents

1. Install Gotz

2. Bean and Locator

3. DataDef and Axis

4. Query, Region and Field

5. Members and Dynamic Query

6. IndexRange and BreakAfter

7. Filters

8. Namespace

9. Tasks and Steps

10. Multiple Task and Tasks

11. Converters

12. Create Locators from Links

13. Persistence

14. Appenders and Encoders

15. Split Definition Files