JS-driven webpages and SEO solution


PHP Headless browser PhantomJS Apache/NGINX

One day, reading about the magic of the AJAX architecture, I asked myself why search engine crawlers are not able to parse (and thus index) Javascript-driven websites. Today you'll read about a solution able to get rid of this limitation!

WARNING: this tutorial assumes you can install custom packages (phantomjs) and execute shell commands from a PHP webpage.

The idea behind this solution is to let a headless browser read the web page and return the executed result directly to the crawler. This is necessary because crowlers have minimum to no support for AJAX / JavaScript in general, and this hasn't changed by some time now. This is an implementation of one of the solutions proposed by Google itself.

There are two kind of JavaScript-driven websites: the ones that never change page (i.e. changing the URI after the hashbang => example.com/#!my_page) and the ones that simply use AJAX calls to get the data to be displayed or easily sorted. For the first case, both Google and Bing! replace the hashbang part with something like ?_escaped_fragment_=my_page, so that it's easy to understand server-side it's a bot who's asking for the page.

Also, Google advices to add a meta tag to tell the crawler it's actually a dynamic page and not a static one (so that it will return more often even though the page seems to be static).

There are two possible solutions:

  1. to detect the page is being viewed by a crawling bot and write a static page for it
  2. to let an intelligent browser parse the page for the bot and serve the result directly

The former solution is host-wise affordable by everyone, it's easier to setup but it requires nearly the double of work (think about writing the website twice).

The latter solution is time and maintenance-wise the top: you don't have to rewrite the code twice (thus saving time and headaches when changing something), you don't need to write extra intelligent redirect code when people come to your escaped frament version of the site, but you need to be able to execute shell commands. Also this solution is more CPU expensive (but negligible, thinking about how many times bots may visit your website compared to normal users).


Attention: this work is under the Creative Commons Attribution 3.0 Unported licence


js-driven_webpages_crawling.zip (scripts and a simple test page).


Installing software

There are several headless browsers (here you can find a list with some of them). I personally like PhantomJS because it's easy to install, to configure and because it's based upon WebKit (besides being 100% compatible with JavaScript of course!)... last but not least, it's free software!

The installation may be done via a linux package system (like ArchLinux's pacman) or via installer. In any case, this is the official download page. The importance is that it can be executed as a system's command, so for Windows users, you may want to (1) include the PhantomJS's folder to the %PATH% environment variable, or (2) put the executable direcly in the System32 folder. Alternatively you'll need to specify the absolute path later in the PHP file.

Besides PhantomJS, you'll obviously need a web server (Nginx, Apache, Lighttpd or whatever) able to run PHP pages.

Configuring the web pages

This is pretty easy: you'll need to insert a custom data-status attribute to the BODY tag. This will indicate the page is ready to be snapshotted. Remember to change its value to ready (via JavaScript) once the page is ready, or crawling bots will always take the timeout time to load the page!

:::html
<body data-status='loading'>

The PhantomJS configuration file

This configuration file tells the headless browser what to do. The idea is to pass it the website URL to process and return the result as soon as the page is ready to be read.

:::JavaScript
// wait_for.js
var system = require('system');

if(system.args.length != 2)
    phantom.exit();

var url = system.args[1];
var page = require('webpage').create();
// to prevent proxy loops, whitelist it in the server configuration
page.settings.userAgent = "PhantomJS simple processor/1.0";

function printResult(page) {
    console.log(page.content);
    phantom.exit();
}

function waitFor(page, test, selector, timeout) {
    var now = (new Date()).getTime();

    var testresult = page.evaluate(function (test) {
        return document.querySelector(test);
    }, test);

    if(typeof(testresult) != "object")
        printResult(page);

    var result = page.evaluate(function (selector) {
        return document.querySelector( selector );
    }, selector);

    if(typeof(result) == "object" || timeout <= 0)
        window.setTimeout(function () { printResult(page); }, 50);
    else {
        timeout = timeout - ((new Date()).getTime() - now) - 50;
        window.setTimeout(function () { waitFor(page, test, selector, timeout); }, 50);
    }
}

page.open(url, function (status) {
    // error loading the page
    if(status != "success")
        phantom.exit();

    // wait for the document to be ready and then print the result
    waitFor(page, "body[data-status]", "body[data-status='ready']", 5000); // 5 seconds timeout
});

The script first checks the "test case", in order not to search something that will never exist, then checks for the existance of the "selector" object, which in this case is BODY with data-status set as "ready". If this check fails and timeout is not expired, it will retry 50ms later, otherwise it will print out the page, ready or not.

The PhantomJS calling page

This page will serve the PhantomJS-processed pages to the crawling bots.

:::php
<?php
    // phantomjs.php
    $siteUrl = "http://mywebsite.com"; // replace with your website URL
    $siteUri = $_GET['page'];

    $pjs = "phantomjs";
    // Windows users
    // $pjs = "c:\\phantomjs_folder\\phantomjs.exe";
    $config = "/home/user/public_html/mywebsite/wait_for.js";

    echo shell_exec("$pjs $config $siteUrl$siteUri");
?>

shell_exec is a really dangerous function to call if not properly thought about what to execute. This is why we're not going to directly call it. We don't want to be a "PhantomJS proxy" for whoever wants to abuse of it: for this reason we're going to parse our site URIs only.

Configuring the web server

You have to redirect all the crawling bots traffic to the PHP PhantomJS proxy page. You may decide to stray even static pages, or decide a more intelligent or ad-hoc solutions. The importance is that you call the proxy like this: phantomjs.php?page=$uri ($uri is nginx-specific). In this way the proxy will be able to restore the original request.

Pay attention not to create proxy loops: it is for this reason that we've setup a custom user agent in the configuration file, so that it's easy to avoid it being incorrectly detected as a crawling bot. The following is a simple nginx example configuration file

:::nginx
server {
    server_name example.com;
    root "/srv/www/public_html/example.com/";

    if ($http_user_agent ~* spider|crawl|slurp|bot|feedburner ) {
        rewrite ^(.*)$ /phantomjs.php?page=$1;
    }

    location ~ \.php$ {
        try_files       $uri =404;
        fastcgi_pass    php-farm; # your PHP-FPM unix socket or whatever
        include         nginx.fastcgi.conf; # your FastCGI configuration
    }
}

Alternatively you can insert the following META tag in your webpage to let crawlers add ?_escaped_fragment_=, so that the resulting page will be http://example.com/page?_escaped_fragment_= independently by the hashbang presence. In this way you can perform a regex check on the URI instead of the user agent.

:::html
<meta name="fragment" content="!">

In any case, notice that the user agent check is safer, as it's not said all the crawlers will perform this URI change.

Testing the solution

In a UNIX envinronment it's pretty simple, just run this command:

curl -A "Googlebot/2.1" http://yourwebsite.com/your_js_page.[php|html|whatever] > output.html

The resulting output.html page should contain what the crawling bots will see when they'll browse your website. If the output is still in the "raw" state, it means that the crawling bot detection did not work, or the timeout is not sufficient (5 seconds).

- 6th September 2013

> back