stargeek
PHP news website logo.
home    PHP scripts    articles    seo tools    links    search    contact    shop    realtors


Analyzing Site Usage Statistics with PHP.


Most of the popular site usage statistics analyzation software fall flat when attempting to perform some of the more complicated calculations and reporting. Many times we find ourselves wishing we could record and monitor some statistical information, and then not only have it displayed in a useful and meaningfully way, but also in a way that makes sense to us. Specifically my problems with the common webalizer apache log file parser, are its rough-around-the-edges support for referral logs and its handling of google search terms, as well as the complete absence of visually presented trend, or average, lines through time, and the fact that it is imposible to use webalyzer to view real-time information. On the other side there is the javascript based solutions, similar in implementation to "free counters" these setups require extra code be inserted into all of your pages and run on the client side (the number of things that could go wrong with this is a scary thought, given that most of the information can be gathered, safely, on the server-side).

So we created a custom system that records usage data by page and can analyze it in a number of useful ways, including support for easy to create plugin style pages. We will break down our setup into its 3 component layers:

  • the data layer

    • Mysql Database
    • Cookies and Sessions
  • the request layer

    • What do we put in the Database?
    • The SQL query
  • the display layer

    • Display Method 1: print_r()
    • Scaling raw data for display:
    • Display Method 2: Streched Images
    • Display Method 3: Dynamic Graphs with GD
      • Trend Lines
      • High Water Marks
    • Parsing Google Search Terms

The Data Layer:

There are two main data stores we will be using, one is a mysql table:

CREATE TABLE times ( page text, time text, referrer text NOT NULL, browser text NOT NULL, ip text NOT NULL, timeStamp text NOT NULL, id int(10) unsigned NOT NULL auto_increment, PRIMARY KEY (id) ) TYPE=MyISAM;

and the other is cookies and the session_save_path(). By setting the session_save_path() to a directory we have full access to (especially important on shared hosting setups) we can then perform our calculations on it.


<?php
session_save_path
($path);
?>

We count the number of files in the $path directory with a simple shell command (thanks lv). This code will simply output the number of currently open sessions to the browser, however it is possible to use one of the other shell execution functions to return the data in a variable for display or tracking elsewhere.


<?php
passthru
('ls $path |wc -l');
?>

It is also helpful many times to manually set a timeout session destroy length, here we limit the amount of idle time a session will remain open and "Active" during, after which the server-side session file and the client-side cookie are destroyed.


<?php
session_cache_expire
(20);
?>

The Request/Logging layer:

Once our datastores are setup we need to begin collecting useful information when a user requests a page. A few operations will be performed at this level, including inserting a line into our database, starting or continuing a session, and setting or modifying a user's cookie. To keep this code organized and optimized is important since it will be executed on the majority of page views, to accomplish this we create a single file, called "log.php" , which we can include on every page we'd like to log. One of the biggest hindrances to keeping this file small and fast is the fact that we need a database connection to perform our logging, and creating a new resource when there already may be one open is an expensive process, instead, since in all of our code we use the same variable name for our database connections we can do something like this to save some system resources:


<?php
if(!$linkID)
{
    
//connection code
}
?>    

What do we put in the database?

The PHP variables we will be using to gather data about the user come from the $_SERVER super-global array:

$_SERVER[REQUEST_URI]  is the actual GET request made by the viewers agent
$_SERVER[HTTP_REFERER] the referrer field, or page the viewer came from
$_SERVER[HTTP_USER_AGENT] the viewers browser (or rss reader)
$_SERVER[REMOTE_ADDR] the viewers IP address

We also generate a unix-timestamp server side and a unique id number to order the rows in the table, when we select rows from this table we can order by either of these fields with similar results, or we can select timestamp values within a specified range to get a snapshot of a particular day, or week.

The SQL query

Here is the SQL query we use to record our information:

insert into times (page, time, referrer, browser, ip, timeStamp) values ('$_SERVER[REQUEST_URI]', '$total', '$_SERVER[HTTP_REFERER]', '$_SERVER[HTTP_USER_AGENT]', '$_SERVER[REMOTE_ADDR]', '$now' )";

This file is also the place where we can include our session-based code, including setting the save path and timeout as well as starting a new or continuing an existing session.

The report/display layer

There are many ways to display your usage statistics information visually, and each method has its own fortes and shortcomings, GD library can be used to create great looking, professional quality dynamic images, php's print_r function can be used to quickly display arrays of stored and sorted data for a more granular look, and img tags with specified widths scaled to represent various values provide a sort of intermediary solution.

Display Method 1: print_r()

Using print_r is the easiest and fastest method to display your report data, we are going to use it to display the googlebot crawl data, both by hits per day and by hits per page. Grouping our data in these two methods allows us to keep track of not only when googlebot is crawling our site and how deeps it is going, but also what pages it views the most, and when our new pages are spidered.


<?php
    $result
= mysql_query("select timeStamp, page from times where browser =     'Googlebot/2.1 (+http://www.googlebot.com/bot.html)'",  $linkID);
    
    while (
$line = mysql_fetch_array($result))
    {
        list(
$date, $time) = explode(' ', $line[0]);
        
$page = $line['page'];
        
$crawls[$date]++;
        
$crawl_pages[$page]++;        
    }
    echo
"<pre>";
    
arsort($crawls);
    
print_r($crawls);
    
arsort($crawl_pages);
    
print_r($crawl_pages);
?>

Scaling raw data for display:

The next two methods we will be presenting both require an understand of some simple graphing and statistical math (specifically making sure that the graphs are scaled to best optimize the amount of data and its clarity in the reports). Here is a simple formula for calculating a $scale variable based on (max($hits) where $hits is an array of hits per day) the largest value that we need to graph and the number of pixels (or other units) that we have available to our graph.


<?php
$scale
= $pixels/max($hits);
?>

We can then use this $scale variable, which is generally less than 1, to generate scaled-down representations of our $hits data.


<?php
foreach($hits as $value)
{
    
$scaled[] = $value*$scale;
}
?>

Display Method 2: Streched Images

The simplest use of this scaled data to create a graph is to use a small 1px wide image and stretch it to our adjusted values, in this example 'bar.gif' is a 5 pixel high black image.


<?php
foreach ($scaled as $value)
{    
   
$html .= "<img src='bar.gif' width='$value'>$value<br/>";
}
?>

Display Method 3: Dynamic Graphs with GD

The most impressive looking method of displaying our statistical data covered in this article, is to use php's GD library to dynamically create .png graphs. This method, however, can also be the most complicated solution and take the most time to develop and implement and it requires the gd library php extension be installed on the server. Short of writing an entire tutorial on how to use GD to create images and graphs, there are a few points we are going to explore as they relate to analyzing site usage statistics. If you need a general introduction to using gd and how to create basic graphs with it, here are a few links:

http://www.phpbuilder.com/columns/allan20000830.php3
http://php.weblogs.com/GD

The two peices of data that we find most interesting and helpful when using gd to create a usage statistics graph are trend lines and high water marks.

Trend Lines

Trend lines are dereived from a set of points on the graph marking day by day averages. A line is drawn connecting one day's point to the next and a more stable and less volatile measure of data like raw hits (which is frequently fluctuating) is created. Using these lines you can better gauge in what direction your stats are going over a long term, rather than a single day's performance.

In this example we are using an un-scaled $hits array to calculate our average hits per day value. We then use the $scale value to scale the average values to fit into our graph. Because the points on in a gd generated image place the coordinates 0,0 in the upper left corner, we must subtract our y variable from the total height of the graph to effectively invert the points and display the largest values at the top of the graph.


<?php
foreach($hits as $hit)
{
$total=$total+$hit;
if (
$x)
{
    
$ave = ($alast + $hit)/$x;
    
$xpos = $x*$per_day_pixels;    
    
$scaled_ave = $height-($ave*$scale);
    
$scaled_last = $height-($last*$scale);
}

imagefilledrectangle($image, $xpos-3, $scaled_ave-3, $xpos+3, $scaled_ave+3, $navy);
if (
$last)
{
    
imageline($image, ($xpos-$width), $scaled_last, $xpos, $scaled_ave, $navy);
}
$last = $ave;
$x++;
}
?>

High Water Marks

To gauge your site's peak performance levels on any given statistic, it is useful to graph a high water mark line. This is simple a horizontal line spanning the width of the graph, drawn at the y coordinate equal to the highest value of the statistic we are measuring.


<?php
$highest
= $height-(max($hits)*$scale);
imageline($image, 0, $highest, $width, $highest, $red);
?>

Using these two lines in addition to the more obvious plots of raw hits, unique hits and google generated hits, we are able to come to a better understanding of our web site's performance over time.

Parsing Google Search Terms

One of my favorite pieces of information that can be gathered from statistical analyisis is that of google search terms. By parsing the $_SERVER['HTTP_REFERRER'] value a user carried when arriving via a google search, we can extract the terms they used to find our site. This sort of information can be an invaluable source of input on not only what search terms you should focus on SEO'ing for, but also what your viewers are looking for. This data can be broken down into either individual words, sorted and displayed, or the phrases can be left whole. We prefer to use the whole phrases, since most popular one word search terms are highly competitive and very vague. In this example $line is the row of the our log table that we are currently analyzing.


<?php
    
if(stristr($line['referrer'], 'google.com'))
        {
                
$ref = strtolower($line['referrer']);                
                
$ref = str_replace('http://www.google.com/search?', '', $ref);
                
$ref = str_replace('hl=en&lr=&ie=utf-8&oe=utf-8&', '', $ref);
                
$ref = str_replace('hl=en', '', $ref);
                
$ref = str_replace('ie=utf-8', '', $ref);
                
$ref = str_replace('oe=utf-8', '', $ref);
                
$ref = str_replace('sourceid=navclient', '', $ref);
                
$ref = str_replace('btng=google search', '', $ref);
                
$ref = str_replace('ie=iso-8859-1', '', $ref);
                
$ref = str_replace('ie=windows-1251', '', $ref);            
                
$ref = str_replace('safe=off', '', $ref);
                
$ref = str_replace('newwindow=1', '', $ref);
                
$ref = str_replace('num=20', '', $ref);
                
$ref = str_replace('hl=nl', '', $ref);
                
$ref = str_replace('btng=google search', '', $ref);
                
$ref = str_replace('inlang=ru', '', $ref);
                
$ref = str_replace('hl=ru', '', $ref);
                
$ref = str_replace('lr=', '', $ref);                
            
                
$x = explode('&', $ref);
                foreach (
$x as $var)
                {
                    if(
stristr($var, 'q=') and (!stristr($ref, '?')))
                    {
                        
$ref = urldecode (str_replace('q=', '', $var));
                        
$ref = str_replace('&', '', $ref);
                        
$wordFreq[$ref]++;
                    }                                                                                    
                }        
                
        }
?>

This will give us an array, $wordFreq, that we can display either with a simple print_r or by using stretched images, in either case you will probably want to perform an arsort() on it first.

Putting it all together:

In creating a report page for our site usage analyzation system, we break up the data presented on the various pages. This is helpful due to the strain looping through and processing a large number of mysql rows can cause, especially when the database begins to get large. On our main report page we present, total hits per month, day and week, along with similar statistics for unique hits. We use stretched image graphs to present referrer, hits by page and google search query data, and one gd image to display our raw, average, unique and google driven over time information. Other data is contained in much smaller and faster files, one file for the number of users currently on the site and one for google crawl statistics.

Email the author.