Gotz uses set of definition files to extract data from HTML pages. As HTML pages are unstructured in nature, the definition to extract and handle data can become quite complex. To learn the XML elements used by the definition files, Gotz distribution comes with a set of examples which are under def/examples folder. Examples are named as ex-1, ex-2 and so on, each with increasing complexity.

Gotz Definition Files

Gotz creates the data model based on XML definition files. For now, we define two files - bean.xml and job.xml.

Bean file

The basic definition file in Gotz is bean.xml, which wires together all classes required by GotzEngine. We can specify the bean file using gotz.beanFile configuration property, which is normally set in gotz.properties file located in conf folder. By default, it is set to defs/examples/jsoup/ex-1/bean.xml which loads the example 1.

The bean.xml specified in defs/examples/jsoup/ex-1 is shown below

defs/examples/jsoup/ex-1/bean.xml

<gotz xmlns="http://codetab.org/gotz">
    <bean name="task" xmlFile="job.xml" />
</gotz>

The element <bean> is used to specify the xmlFile to load. In the example, we are indicating that Gotz should load job.xml which is located in same folder as bean.xml.

Job file

The job file defines the model objects required to run Gotz. The broad outline of the job file is :

  • <gotz>
    • <locators>
    • <dataDefs>
    • <fields>

The root element <gotz>, its XML Namespace and <fields> are explained in the later chapters. For now, we focus on locators and datadefs.

 
 

Locators

In Gotz, locator is used to specify the HTML page either to fetch from the Internet or to load from local file system.

The element <locators> specifies the list of locator and the its group.

In the example job.xml, the locators are defined as

defs/examples/jsoup/ex-1/job.xml

<locators group="quote">
   <locator name="acme" url="/defs/examples/page/acme-quote.html" />
</locators>

Here we are defining single locator named acme. The attribute url points to local HTML file acme-quote.html which is in defs/examples/page folder.

To scrape pages from website, we need to specify the actual address of the page such as http://example.org.

The group attribute is used to group a set of locators. Here the locator acme belongs to group quote.

In the next chapter, we describe DataDef and its components.