With persistence, Gotz offers following benefits.

  • reduce network usage as it can reuse the downloaded pages
  • recover from the aborted run without redoing the tasks already completed
  • avoid expensive parse operation as it can reuse the persisted data
  • set expiry date for each page

If you installed Gotz from GitHub, then to run examples with persistence we need to setup database. We explain how to install and setup HSQLDB as it is the easy one. But, in case you have installed Gotz from Docker image then you can skip installing HSQLDB as image comes with MariaDB instance preconfigured as explained in Install Gotz. Otherwise, download HSQLDB version 2.3 and extract it to some folder. To create database and start HSQLDB server, run

cd hsqldb-2.3.0/hsqldb
java -cp lib/hsqldb.jar org.hsqldb.server.Server --database.0 file:data/gotz --dbname.0 gotz

This creates database named gotz and places database files in data directory.

Use HSQLDB client to connect running database and to do that run following from another console

java -cp hsqldb.jar org.hsqldb.util.DatabaseManagerSwing

To connect to HSQLDB server, enter connection details as shown in the figure and click OK.

Gotz Persistence HSQLDB client

Enable Persistence in Gotz

By default persistence is enabled in Gotz, but to run examples, we have disabled it by setting gotz.useDatastore property as false in conf/gotz.properties file. To enable persistence, either set this property as true or remove the property by hashing it out.

Now run Example-10 and you should have locator, datadef, data and document tables with data in gotz database.

 
 

Live

When we first run Example-10 with persistence enabled, document table which contains compressed contents of the pages - acme-quote-links.html, acme-bs.html and acme-pl.html will end up with three rows and table data which holds the parsed data will have five rows - finLinks, price, snapshot, bs and pl. If we run Gotz again and inspect the database tables, the document table will have six rows and data table will end up with 10 rows. For each run, Gotz fetches fresh pages and parse it to get data even though persistence is enabled. This is because, by default, live setting which controls the expiry of page is set to zero days (P0D). So, in each run, Gotz sees page as expired and fetches new one. We can use live with each tasks group to alter this behavior.

Suppose, we want to fetch new quote page every day but BS and PL pages once in 3 months as they change less frequently. The defs/examples/jsoup/ex-11/job.xml sets live element in for each <tasks> as follows

<tasks name="quote tasks" group="quote">
    ... task ...

    <live>P1D</live>
</tasks>

<tasks name="bs tasks" group="bs">
    ... task ...
    <live>P3M</live>
</tasks>

<tasks name="pl tasks" group="pl">
    ... task ...
    <live>P3M</live>
</tasks>

The P1D, P3M are ISO_8601 based representation of duration.

Now only for first run fresh pages are fetched and persisted and for subsequent runs the persisted page and data are reused till pages are expired. This speeds up the run many fold.

Persist Setting

We can further control how locator or data is persisted using persist element in each task.

For example there is no need to store link data in database as they have little relevance other than to create new locators. So to disable links persistence, we add persist/data as false to fin links task.

<tasks name="quote tasks" group="quote">
    <task name="fin links" dataDef="finLinks">
        <steps ref="commonSteps" >
             ... local step ...
        </steps>
        <persist>
            <data>false</data>
        </persist>
    </task>

Similarly we can set persist/locator for tasks element which disable persistence of locators group and their pages. The persist/data is set in task element where as persist/locator is set in tasks element.

 
 

Use Gotz with other Databases

Gotz uses JDO and DataNucleus AccessPlatform which works with variety of databases. Gotz is tested with MySQL, MariaDB and MongoDB and configuration for each of them is in conf/jdoconfig.properties file.

In the next chapter we look at appenders and persist output data to database.