Persistence
With persistence, Scoopi offers following benefits.
- reduce network usage as it can reuse the downloaded pages
- recover from the aborted run without redoing the tasks already completed
- avoid expensive parse operation as it can reuse the persisted data
- set expiry date for each page
Setup Database
If you running Scoopi from docker image then no database setup is required as docker image contains pre-configured MariaDB container. Please refer Scoopi with MariaDB to run Scoopi with MariaDB using docker-compose.
In case you are using release package from GitHub to run Scoopi, then you have to set up MariaDB or HSQLDB database manually. See HSQLDB Setup to setup and configure Hsql database.
Enable Persistence in Scoopi
By default persistence is disabled in Scoopi through configuration
setting scoopi.useDatastore=false in conf/scoopi.properties
file. To
enable persistence, either set the property scoopi.useDatastore=true
or hash it out.
Now run Example 10 and scoopi database should have locator, datadef, data and document tables with data.
Live
When we first run Example-10 with persistence enabled, document table which holds compressed contents of the pages - acme-quote-links.html, acme-bs.html and acme-pl.html will end up with three rows and data table which holds the parsed data will have five rows - finLinks, price, snapshot, bs and pl.
If we run Scoopi again and query the tables, the document table will have 6 rows and data table will end up with 10 rows. For each run, Scoopi fetches fresh pages and parse it to create data even though persistence is enabled. This is because, by default, live setting which controls the expiry of page is set to zero days (P0D). So, in each run, Scoopi sees page as expired and fetches new one. Use live property in tasks group to alter this default behavior.
Suppose, we want to fetch new quote page once in a week. The Example 11 sets live property to P1W for quoteGroup as shown below
taskGroups:
quoteGroup:
priceTask:
dataDef: price
snapshotTask:
dataDef: snapshot
live: P1W
The P0D, P1W are ISO_8601 based representation of duration.
Now only for first run fresh pages are fetched and persisted and for subsequent runs the persisted page and data are reused till pages are expired. This speeds up the run by many fold.
Persist Control
We can further control persistence of locator or data using persist property.
In conf/scoopi.properties
we can set following properties
scoopi.useDatastore=true|false
scoopi.persist.locator=true|false
scoopi.persist.data=true|false
If scoopi.useDatastore is false then there is no need to start database as nothing is stored. If scoopi.useDatastore is true, then
- if scoopi.persist.locator is false then no locator and its documents are stored
- if scoopi.persist.data is false then no data is stored
If scoopi.persist.data is true, then in addition, we can control whether to persist data of each task using persist/data property in task definition. The Example 12
taskGroups:
quoteGroup:
priceTask:
dataDef: price
persist:
data: true
snapshotTask:
dataDef: snapshot
persist:
data: false
The data parsed by priceTask are stored but not the snapshotTask as persist/data is false for that task.
Use Scoopi with other Databases
Scoopi uses JDO and DataNucleus
AccessPlatform which works with variety of
databases. Scoopi is tested with MySQL, MariaDB and MongoDB and
configuration for each of them is in conf/jdoconfig.properties
file.
In the next chapter we look at appenders.