Pagination

Many pages provide pagination buttons or links to navigate through series of pages. We can use link locators described in previous chapter to scrape pages using pagination.

The Books example folder contains three HTML pages which has pagination link at the bottom of pages.

defs/examples/book/page/page-1.html


<div>
    <ul class="pager">
       <li class="current">Page 1 of 50</li>
       <li class="next"><a href="page-2.html">next</a></li>
    </ul>
</div>

The Books Example 4 go through each page and scrape book items.

locatorGroups:
  bookGroup:
    locators: [
       { name: "books", url: "/defs/examples/book/page/page-1.html" }
    ]

taskGroups:
  bookGroup:
    bookTask:
      dataDef: bookData

    linkTask:
      dataDef: bookLink
      steps:
        jsoupDefault:
          process:
            class: "org.codetab.scoopi.step.process.LocatorCreator"
            previous: parser
            next: seeder

The locator defines only the first page. The bookTask parse book items from the page while the link task scrape next page link and creates a locator which points to next page. The process goes on till the last page.

While datadef to scrape link is as below

dataDefs:
  bookLink:
    query:
      block: "li[class='next']"
    items: [
      item: { name: "link",  selector: "a attribute: href",
        linkGroup: bookGroup, prefix: [ "/defs/examples/book/page/" ] },
    ]

It creates a new locator pointing to the scraped link and assign it to bookGroup so that bookGroup tasks - bookTask (which scrapes book item) and linkTask (which scrapes next page link) are executed on the new page.

In this previous chapter we scrape link in snapshotGroup and assigned it bsGroup for processing. But, in pagination there is no shift in task group. In the pagination example, we scrape link in bookGroup and assign it back to the same group.

Breaking the pagination

The Books Example 5 breaks pagination using linkBreakOn as show below.

bookLink:
    query:
      block: "li[class='next']"
    items: [
      item: { name: "link",  selector: "a attribute: href",
        linkGroup: bookGroup, prefix: [ "/defs/examples/book/page/" ],
        linkBreakOn: [ "/defs/examples/fin/page/page-3.html" },
    ]

With linkBreakOn, it paginate through first and second page and breaks when it encounters scraped link value is page-3.html. To break, link value use full link including prefix, if any.

The next chapter covers encoders and appenders to output data.