This is a follow up to yesterday's post on headless browser testing using Selenium and…
I’ve written before about using CasperJS for doing headless testing but it’s also very useful as a web scraper. Before we start, a couple of caveats – firstly, be sure that you have permission to scrape and use the content you’re after; secondly, be a good citizen and space your requests out so as to not overload the server; thirdly, if the site you’re scraping doesn’t use much/any JavaScript for navigation you’re likely to get a faster result by using tools which just grab the HTML of a page such as WWW::Mechanize, Mechanize, HtmlAgilityPack, Beautiful Soup or HtmlUnit.
I find the most compelling use case for CasperJS scraping is when a site relies on a lot of JavaScript to navigate through the content; a recent project was a perfect example as it uses AngularJS, loads all the content asynchronously and uses infinite scrolling instead of pagination. I’ll walk you through some of the challenges I overcame during this project.
If you’re not familiar with AngularJS, it’s a JavaScript framework that allows you to build single-page web applications. It handles a lot of the plumbing, data access and binding for you. What that means is that, in a lot of cases, what comes back from the server is just an empty HTML skeleton with lots of JavaScript to fetch the data and display it. This can be great when you want to build dynamic applications but it makes scraping the content a lot harder as using a traditional HTML-only scraper won’t give you any content.
By using CasperJS I was able to run a real browser (albeit one which doesn’t display anything) to execute the JavaScript, render the page and build the DOM for me. The main thing I had to tackle was informing CasperJS to wait for a specific element that I knew was part of the finalised content before grabbing what I was after.
casper.open(url, function() { this.waitUntilVisible('h2.head', function() { var categories = []; casper.each(this.getElementsInfo('a.label'), function(casper, element, j) { var category = element["text"]; categories.push(category); }); }); });
The waitUntilVisible method will, as the name suggests, wait until a particular element becomes visible before processing the next step. This is useful with asynchronous loading pages as it lets us hook our scraping onto the loading of particular content within the page.
Infinite scrolling is a navigation mechanism that is being adopted by numerous sites in place of pagination. If you use Twitter or Facebook you would have seen it – scroll to the bottom of the page quickly enough and you’ll catch the little spinner pixies doing their dance while the next batch of content is loaded. While this can make for a fluid user experience, it’s another potential pitfall for the web scraper. This one took me quite a while to get nailed.
What I needed was to scroll down to the bottom of the page, see if the spinner showed up (meaning there’s more content still to come), wait until the new content had loaded and then keep scrolling until no more new content was shown. After much trial and error I found out about the scrollPosition property of the underlying PhantomJS WebPage object which will properly scroll the page to the position required.
var urls = []; function tryAndScroll(casper) { casper.waitFor(function() { this.page.scrollPosition = { top: this.page.scrollPosition["top"] + 4000, left: 0 }; return true; }, function() { var info = this.getElementInfo('p[loading-spinner="!loading"]'); if (info["visible"] == true) { this.waitWhileVisible('p[loading-spinner="!loading"]', function () { this.emit('results.loaded'); }, function () { this.echo('next results not loaded'); }, 5000); } }, function() { this.echo("Scrolling failed. Sorry.").exit(); }, 500); } casper.on('results.loaded', function () { tryAndScroll(this); }); casper.open('http://my.site/search-results', function() { this.waitUntilVisible('a.result-large', function() { tryAndScroll(this); }); }); casper.then(function() { casper.each(this.getElementsInfo('a.result-large'), function(casper, element, j) { var url = element["attributes"]["href"]; urls.push(url); }); });
So what we’re doing here is using the CasperJS event handler to detect when a new batch of content has been loaded and to keep on scrolling until the spinner is no longer shown. If the spinner is shown, we wait until it disappears again using the waitWhileVisible method.
Now that I had the data I wanted, the next step was to write it to a file. This one is fairly simple. As I grabbed each detail url and extracted what I needed, I also constructed a CSV string in memory. At the end I wrote it all to a file using the PhantomJS FileSystem API. As this was a bit of a quick and dirty hack I just strung together the values I needed with a comma in between; there may be more robust solutions out there which handle stuff like commas within a data field but I didn’t need that.
casper.then(function() { require('fs').write("outputdata.csv", data, 'w'); });
Came across with this lately, seems pretty neat as well. Great for complex JS site but not fully headless though. http://www.slimerjs.org/index.html
Thanks 🙂
Was unable to get your script working, but CasperJS now provides Casper.ScrollToBottom( ). Sadly, that breaks on Twitter when using jQuery…
casper.create({ clientScripts: [‘jquery.js’]}); Haven’t found a way to infinite scroll Twitter and load content from casper.evalutate; if you have any tips let me know.
I found this link to be similar to what you need as a solution: http://stackoverflow.com/questions/17521065/casperjs-can-not-trigger-twitter-infinite-scroll
Scroll to the bottom page to know how to go about it. Thanks.
I don’t see why you need to bother with the js. SPA are even simpler to scrape than regular page, you only need to analyze the ajax call to the REST server (chrome Dev Tools is the perfect tool for the job) and replicate those calls in your scraper. It is faster, more reliable and simpler to process since you receive json or xml formatted data.
Is there any reason to use CasperJS in your case ?
Please, could you do a demo of this to explain to a novice? I don’t understand your techincal terms and you seem to hold the key with your comments, thanks.
I think there is confusion here between AngularJS and Ajax. AngularJS doesn’t necessarily make Ajax calls (although it can, using the $http service)..
Great insight but what if you want to wait until an element is removed from page instead of visible?
To wait, you could implement a wait function for a specified amount of time. You could take your time to load the page in question in a browser and even check it’s JavaScript file to know how long it take to get removed from the DOM and use that value plus 2 seconds extra to wait.