Scoopi Installation and Quick Start
The easiest way to use Scoopi is to pull the docker image from DockerHub and run it straight away which comes with pre-configured MariaDB. In case, you are not using Docker then download the release from GitHub. We explain both the options here.
Install Scoopi from Docker Image
Scoopi releases are available as docker image from DockerHub. To run the image you need Docker installed in the system and additionally, to run it with database, you also need Docker Compose. The total download size of Scoopi docker image is about 120MB and Mariadb 130MB.
The following command downloads Scoopi image, creates container named scoopi and run it.
docker run --name scoopi codetab/scoopi
It executes example 1 and output one line of data to output/data.txt
.
However, we will not be able to view the output file nor modify the conf
files as they are within the container. To overcome this, we need to
externalize these folders with following commands.
mkdir scoopi
cd scoopi
docker cp scoopi:/scoopi/conf .
docker cp scoopi:/scoopi/output .
docker cp scoopi:/scoopi/docker .
docker cp scoopi:/scoopi/defs .
docker cp scoopi:/scoopi/logs .
Here, we make a folder named scoopi and then copy conf, output, docker, defs and logs folders from the container to it. This allows us to modify conf, def files and also, view the output file without login into the container. We can now remove the container as we are going to recreate it with a new set of parameters.
docker rm scoopi
Let’s run example 10 to output more data. To do that, edit
conf/scoopi.properties
file and change defs directory property as
scoopi.defs.dir=/defs/examples/jsoup/ex-10 and run
scoopi
docker run --name scoopi --rm -p 9010:9010 -v "$PWD"/defs:/scoopi/defs -v "$PWD"/conf:/scoopi/conf -v "$PWD"/output:/scoopi/output codetab/scoopi
Above command mounts externalized folders using -v option. When container run, it uses definitions from jsoup/ex-10 and on completion, we should have a new data.txt file in output folder with 281 lines of data.
Scoopi comes with a nice Angular dashboard which displays internal metrics of the app and it can be accessed via http://localhost:9010 while Scoopi is running.
Scoopi with MariaDB
To use MariaDB as datastore, we need Docker Compose which runs Scoopi
and MariaDB in separate containers. First, move the docker-compose.yml
to scoopi folder
cd scoopi
mv docker/docker-compose.yml .
Next, edit conf/scoopi.properties
and modify property
scoopi.useDatastore=false as scoopi.useDatastore=true. Once
configuration is ready, start database.
docker-compose up scoopi-db
Docker downloads the latest MariaDB image and run it as container. On first run, it also creates and initializes the database, users and privileges. It creates new folder named data which contains the MariaDB data files.
If MySQL client are installed in your system then you can log into database with
mysql -pbar -u foo -h 127.0.0.1 -P 3306 scoopi
Now kill the database container with Ctrl+C. With that, one time setup is complete and from now on, we start using Scoopi with MariaDB with the following command.
docker-compose up --abort-on-container-exit
The above command brings up MariaDB and then Scoopi in separate containers. Scoopi stores locators, documents and parse data in database.
Install Scoopi from GitHub
Alternatively, install Scoopi either by downloading the release package which contains all dependencies or by building the source code with Maven. To run Scoopi with datastore support, then we have to manually install database such as MariaDB or HSQLDB and configure it. We explain the HSQLDB installation in a later chapter on persistence.
Download and install the Release package
Download the latest release zip file scoopi-x.x.x-production.zip
from
GitHub Scoopi Releases
and extract the zip file to some location.
Download and build the Source
Alternatively, you can download the Scoopi source code zip from GitHub. To build it, extract it somewhere and from the project root folder run
mvn package -DskipTests
Maven compiles the source, downloads the dependencies and package the
app as scoopi-x.x.x-production.zip in target folder. Extract
target/scoopi-x.x.x-production.zip
to some location.
Download and install JRE 8 or above
To run, Scoopi requires JRE 8 or above. It is tested both with OpenJDK as well as Oracle Java SE.
Quick start
Go to the extracted folder of scoopi-x.x.x-production.zip. The directory structure is as below.
scoopi-x.x.x/
├── conf
│ ├── scoopi.properties
│ ├── jdoconfig.properties
│ ├── log4j.properties
│ └── logback.xml
├── defs
│ └── examples
│ └── jsoup
│ └── htmlunit
│ └── page
├── scoopi.bat
├── scoopi.sh
└── lib
└── scoopi-x.x.x.jar
└── ....
Application jar file scoopi-x.x.x.jar is in lib folder along with other
dependencies. The conf folder holds the configuration files and the main
configuration file is conf/scoopi.properties
. By default, following
two properties are defined.
scoopi.defs.dir=/defs/examples/jsoup/ex-1
scoopi.useDatastore=false
The property scoopi.defs.dir points to example 1 which is loaded when we run Scoopi. The other property scoopi.useDatastore is set to false which allows us to run Scoopi without setting up database. In a later chapter, we show how to setup database and use it to persist Scoopi objects. Till then, set it to false.
Let’s run Scoopi and check the installation.
cd scoopi-x.x.x
scoopi.sh // scoopi.bat for windows
It starts ScoopiEngine and loads files defined in
defs/examples/jsoup/ex-1
folder and outputs data to output/data.txt
file.
As we progress through the guide, we cover examples one by one. To load
other examples, modify the scoopi.defs.dir property in
conf/scoopi.properties
and run Scoopi.
Examples
Scoopi comes with a set of example definition files : ex-1 to ex-14.
Examples uses HTML pages from examples/page
folder and extract data
from it. The pages are financial data such as Balance Sheet, Profit and
Loss Account and Share price etc., of a company. Each example builds on
the previous one, so that major portion of definitions remains same
throughout this guide for easy understanding of the concepts.
Examples come in two flavors – JSoup which uses selectors to query data and HtmlUnit which uses XPath as query. This guide focus on JSoup examples, as JSoup is easy to use and light on memory. HtmlUnit examples are same as JSoup ones but uses XPath for queries.
While running example you can disable persistence by setting
scoopi.useDatastore=false in conf/scoopi.properties
file. However,
there is no harm in running examples with useDatastore as true but
ensure that database is up and running else examples throw database not
found error.
In the next chapter, we start with Example 1.