JavaScript and WebDriver

So far in the guide, the examples scrape data from regular HTML pages. But, we encounter many web pages that generates and renders pages with JavaScript or Ajax functions. In this and next chapter, we explore ways to seamlessly scrape such pages.

Dynamically generated pages

The website toscrape.com has many endpoints showing the quotes in many different way that are generated through DOM manipulation.

Open JavaScript - JavaScript generated contents page and view its page source. It contains a script to load set of quotes held in an array to page’s DOM. While this page executes script on page load, other pages execute script on user input like button clicks, selection of select field or on form input. JSoup simply can’t scrape such pages as it lacks JavaScript processor while HtmlUnit can scrape them when JavaScript feature is enabled.

To make things even more complex, the website hosts Scroll - infinite scrolling pagination where Ajax requests are made as we scroll down to fetch quotes and load them to DOM and even HtmlUnit fails to handle such complex browser behaviors.

Selenium WebDriver

Scoopi can scrape any page without a hitch as long as page source contains all the elements as rendered by the browser. All we need is to catch hold of fully loaded page and pass its page source to parser step. However, task is not trivial and only the full fledged browser can do that.

The Selenium WebDriver is leading tool to test browser automation. It is used to control and interact with browsers such Google Chrome, FireFox, Safari etc., and can to simulate user interactions in WebDriver and load pages as done by the browser. Scoopi has a robust and extensible work flow, script engine and plugin framework and it quite easy to plugin Selenium WebDriver to handle any sort of page.

In earlier examples, the loader step which fetches web pages through org.codetab.scoopi.step.extract.PageLoader step class and pass its contents to parser step. The PageLoader class can handle only the regular pages and to process Js and Ajax pages loader step has to use org.codetab.scoopi.step.extract.DomLoader class which internally use Selenium WebDriver.

Install FireFox Driver

Selenium WebDriver is an interface which allows programs to interact with browsers such as FireFox or Chrome etc., To use it, we need to install GeckoDriver which is proxy for clients to interact with Gecko-based browsers such as FireFox.

Download appropriate GekoDriver for your system from GitHub GekoDriver Releases pageand extract it. Next, create a directory named .gecko in user home directory and copy geckodriver executable binary to it. In Linux, it is $HOME/.gecko and in Windows c:\Users\${currentusername}.

# linux

mkdir $HOME/.gecko
cp geckodriver $HOME/.gecko

# windows

mkdir .gecko in C:\Users\${logged_in_user_name}\.gecko
cp geckodriver.exe C:\Users\${logged_in_user_name}\.gecko

In case, driver binary is placed in any other folder then then set scoopi.webDriver.driverPath in conf/scoopi.properties file

# linux

scoopi.webDriver.driverPath=/some-folder/geckodriver

# windows

scoopi.webDriver.driverPath=C:\some-folder\geckodriver.exec

JavaScript generated contents

The Quote Example 1 scrape quotations from JavaScript generated contents. This site holds set of quotes in an array and adds them to page DOM on page load in the browser. It also has pagination link to navigate to next page which again uses JavaScript to generate contents.

The locator and task snippet from this example is as below.

defs/examples/quote/jsoup/ex-1/job.yml

locatorGroups:
  quoteGroup:
    locators: [
       { name: quotes, url: "http://quotes.toscrape.com/js/" }  
    ]

taskGroups:
  quoteGroup:
    quoteTask:
      dataDef: quote
      steps: 
        jsoupDefault:
          loader:
            class: "org.codetab.scoopi.step.extract.DomLoader"
            previous: seeder 
            next: parser

The quoteTask uses jsoupDefault steps and override its loader step to use DomLoader class.

The datadef from the example is shown below.

defs/examples/quote/jsoup/ex-1/job.yml

dataDefs:
  quoteLink:
    query:
      block: "li.next"              
    items: [ 
      item: { name: "link",  selector: "a:nth-child(1) attribute: href", linkGroup: quoteGroup, 
              prefix: [ "http://quotes.toscrape.com" ],  
              linkBreakOn: [ "http://quotes.toscrape.com/js/page/4/" ] },
    ]

  quote:
    query:
      block: "body > div div.quote:nth-child(%{item.index})"
      selector: "span:nth-child(1)"            
    items: [
      item: { name: "quote", indexRange: 4-13, value: "quote"},      
    ]
    dims: [
      item: { name: "by", selector: "span:nth-child(2) > small" },
      item: { name: "tags", selector: "div a" },     
    ]

There is nothing new in the dataDef and it is same as ones in earlier examples. It defines two dataDefs - quote to scrape quote from the page and quoteLink to paginate to next page. The pagination breaks when it encounters page 4 as linkBreakOn is set to url of page 4.

In this next chapter, we explain the use of scripts to emulate browser interactions.