[ Home | What We Do | Our Clients | Press & Events | Library | Contact Us ]


Log Files for Business Managers:
A Tutorial Guide to What Matters About Log Files

Fastwater Rapids vol. 1.10, 2Dec98

by Bill Zoellick
 

Web server log files are central to any effort to analyze a web business.  Log files, or equivalent information gathered directly from the network, are a key source of the data used in the web business analysis process.  Despite this central role, however, many people concerned with making web business decisions -- marketers, product managers, business unit managers, and others -- have only a vague idea of what is really in a log file.  This makes it hard for them to understand the full range of possibilities opened by log file data.  It also gets in the way of knowing what is NOT possible.

We have discovered that there is a good explanation for the lack of log file literacy:  we have been unable to find any good, non-technical, easily available descriptions of log file data and how it is used.

This week's Rapids addresses this situation.   What follows is a basic introduction to log files.  Here is an outline of what we cover ...

  1. What is in a Log File?
  2. How do You Know Where Visitors Come From?
  3. How do You Get Useful Information From Log Files?
  4. Why do the Numbers From Different Log File Analysis Packages Disagree?
  5. Do Log Files Scale Up For Larger Sites?
  6. The Bigger Picture: Log Files and Other Data
  7. Getting the Most from Log Files
Our goal is to make you "log file literate," so that you can talk to the technical people in your company and be able to say just what you want, and to know that what you want is doable and reasonable.  Getting this job done means writing an article that is more like a short chapter in a book.  We suggest that you skim it, pick up the main points, spend some time on the Getting the Most from Log Files section, and then file this article for future reference.  You will need it the next time you sit down to think seriously about gathering and analyzing visitor data on your site.

What is in a Log File?

The best way to find out is by looking at one.   If this is your first time looking at a log file, several features will probably impress you right away: We will look at a small bit of the log file from our own, Fastwater site for Friday, November 27 -- the day after Thanksgiving.  We'll break it apart bit by bit, so that you can get a feel for how the entries in the file fit together.

We start with a person whose initial entry on to the Fastwater site was on that page that lists our Network Economy Practices (NEPs) case studies.  The log file contains a line of text that captures this visit:

207.94.105.77 - - [27/Nov/1998:12:09:53 -0700] "GET /papers/NEPs/ HTTP/1.1" 200 3528

Let's decode this one line -- looking at other lines will be simpler if you know what everything means.  This entry, like every other in the file, consists of 7 distinct fields.  In this case, the information they contain is as follows:
 

Contents Field Name What it is
207.94.105.77 remotehost The IP Address of the host computer from which the request came
- (first hyphen) rfc931 The remote login name of the user -- this is typically not available, which is what the hyphen tells us in this case.
- (second hyphen) authuser User name of an authenticated user -- Most web servers provide a simple, basic method for user login. When this method is used, the user name is recorded in this field. However, due to the method's limitations, most sites do not make extensive use of it, and the user names stored in this field are not generally useful.
[27/Nov/1998:
12:09:53 -0700]
date Date and time of the request, measured to seconds. 
"GET /papers/NEPs/ HTTP/1.1" request The HTTP request, as it came from the client.  In this case we can see that they requested the main, index page within our NEPs collection
200 status The HTTP status returned to the client.  "200" means the request was filled successfully. 
3528 bytes The number of bytes returned.

This request for the NEPs index was followed immediately by three more requests (notice the times) ...

207.94.105.77 - - [27/Nov/1998:12:09:54 -0700] "GET /style/display.css HTTP/1.1"  200 2803
207.94.105.77 - - [27/Nov/1998:12:09:55 -0700] "GET /images/fw_logo_small.gif HTTP/1.1" 200 1488
207.94.105.77 - - [27/Nov/1998:12:09:55 -0700] "GET /images/bullet7.gif HTTP/1.1" 200 140

Our visitor did not intentionally ask for these pages.  Instead, they happened "automatically" as a consequence of the request for the index, since these are files that are referenced in the HTML for the index page.  The first is the Fastwater style sheet used for "display" pages, the second is our logo, and the third is a little bullet that we use in our navigation bar at the top of the page.  (There is one on this page that you are reading.)

The fact that the log file contains EVERY request from the client, including requests for style sheets, images, and so on, is important.   It is both good news and bad news.

One of the most interesting things about log files is that they can sometimes allow you to string together the different parts of one person's visit.  We'll talk about the limitations of this guesswork later.  In the Fastwater log file for November 27, it is a pretty good bet that the person visiting the Network Economy Practices index went on, next, to jump up to our home page.  (The "GET /" request in the line below is a request for the top page in our site hierarchy.)

207.94.105.77 - - [27/Nov/1998:12:10:16 -0700] "GET / HTTP/1.1" 200 10120

We can also see that they did not spend a great deal of time looking at the NEPs index, since only about 20 seconds passed between the delivery of the last part of the index page and the request for the home page.

After the request for our home page, the log file contains 17 lines documenting requests for the different images on that page.  For the most part, these requests are not very useful to us, since they happen automatically.  Consequently, we won't look at them in detail.  We should note, however, if you were diagnosing performance, rather than studying visitor behavior, the log file would tell you that it took a little over 15 seconds for this visitor to get everything from our home page.

He or she then spent something like another 15 seconds looking over the home page, and then decided to click on the "JOIN" image to jump to our membership sign up page.  This is getting interesting ...do we have a new customer here?

207.94.105.77 - - [27/Nov/1998:12:10:45 -0700] "GET /membership.htm HTTP/1.1" 200 6465

The visitor spent about 35 seconds studying the membership offer, and then followed one of the links from that page to go back to the NEPs index -- getting another look at the list of available case studies.

207.94.105.77 - - [27/Nov/1998:12:11:21 -0700] "GET /members/NEPs/index.shtml HTTP/1.1" 200 3875

He or she studied this page again for perhaps thirty seconds.  The next log file entry shows a request for the index business and issues analysis articles contained in our Fastwater Rapids publications:

207.94.105.77 - - [27/Nov/1998:12:12:04 -0700] "GET /members/rapids/index.shtml HTTP/1.1" 200 12647

It is likely that the visitor used the "back" button to return to our membership page before moving to the list of Rapids articles -- the log file would not show that because the membership page was already "cached" by the visitor's browser.  So ... log file entries do not necessarily show you the whole sequence of moves by a visitor.

This visitor spent a good while, more than a minute and a half, reading through the summaries of Rapids issues before deciding to have a look at issue 6, which looks at how Bay Networks is using their website to manage relationships with their distribution channel.

207.94.105.77 - - [27/Nov/1998:12:13:43 -0700] "GET /members/rapids/v1-6_dist-channel.shtml HTTP/1.1" 200 14329

Sad to say for Fastwater, this session did NOT end up with a visit back to our membership page, followed by a visit to our secure server where this person could stop being "visitor" and graduate to "member."  Maybe next time ... but will we know whether it was the same person?  We'll look at that issue in a moment.

Discovering Where Visitors Come From

"Where do visitors to our site come from?"  That is usually one of the questions that anyone with a web business has.  Another is, "How do I get more of them?"  Which is usually followed by: "Where do I find visitors who turn into buyers?"  Our recent case study of Hoovers is a good look at how one company is dealing with such questions.  Log files -- or, more properly speaking, extensions to log files -- are where the data comes from that allows you to begin constructing answers.

The log file entries we have looked at so far only tell us what pages are being requested once a visitor is on our site; they do not show us where the visitor came from.  Most web servers support some kind of extended log format that provides you with "referer" information, showing you the URL the visitor was on when he or she requested your page.  Different brands of server software handle this in different ways; the Apache servers used for the Fastwater website create a separate file of "referer" information.

Let's return to our visitor from remote host 207.94.105.77 who came into our site on the NEPs index page.  It is an interesting case because they clearly did not find us simply by typing in "http://www.fastwater.com/" -- their entry point was a specific page buried deeper in our website.  They were almost certainly following a link from someplace.

The referer log shows the following entry for this visitor's entry on to our site:

http://www.kmworld.com/newestlibrary/1998/november98/theknowlmktplc.cfm -> /papers/NEPs/index.html

The log tells us that this particular visitor came to our site from an article published on the KM World site.  If we follow the link to the KM World page, we find that it is an article that I wrote that contained a reference to the Bay Networks case study.  It makes sense, then, that this visitor finally ended up reading the Rapids article on Bay Networks (though, interestingly, they did not actually read the case study).

Referer logs can tell you more than just where a visitor came from.  If the visitor is coming to you from a search portal site, there is a very good chance that the "referer" URL will give you the search terms they used.  That is because search sites typically include an encoded form of the search request in the URL passed to the search engine.  This encoded URL becomes the page of "hits"  generated in response to the search request.

For example, here is yet another visitor to the Bay Networks case study on the Fastwater site -- this time the visitor reached us through a search on AltaVista.

http://www.altavista.com/cgi-bin/query?pg=q&kl=XX&q=%22personalization+and+customization%22 -> /papers/NEPs/BayNetworks.htm

As you can see, the query was for "personalization and customization".  If you are a curious sort, you can actually paste the entire string starting from "http" and ending in "22" into your web browser and recreate the query that this anonymous visitor created before coming to our site.

The referer information can also be useful for doing the kind of investigation of how users move through the site that we were doing earlier, through inference, with the visitor from host 207.94.105.77.  This is because the log not only shows "referring pages" from outside your site, but also within your site.  So, even if you cannot tie page movements to a particular user (for reasons that we will look at in a moment), you can almost always build a picture of which pages most often lead to which other pages.  This can be quite useful in analyzing general traffic flow on your site.

There is other information besides referer logs that can be collected as an extension to the usual, "common" log file format information.  Most servers, for example, can tell you the brand and version of the web browser used to access the site.  In addition to the obvious uses that this information has in making sure that the browser can support the pages and code used on your site, it can also help with the tracking of a particular visitor as he or she moves around your site,  a task which, as you are about to see, is not always as easy as our earlier example suggests.

How Do You Get Useful Information From Log Files?

Now you have seen everything that is in log files.  That's it.  There is no more.

We have seen that this information can be used to produce a couple of different kinds of basic information about web site activity.

Isn't there more?  Yes, but only if you are willing to make a few assumptions ...

Sessions and Visits

Our walk through the Fastwater log file also showed us that it is sometimes possible to move beyond the information about individual page requests to a picture of what a particular visitor does in a single session, or "visit," to your site.  In our example we saw that we could make reasonable guesses about: Aggregated over many visitors, this is potentially very valuable information.  As the Hoovers case study illustrates, the ability to follow a visit from initial reference to conclusion allows you to discover which referring sites produce buyers as opposed to "lookers."  It also can enable you to develop a characteristic behavior profiles to distinguish between buyers and lookers.

But all of this assumes that you can string together the different log file entries to create a "visit."  If you think about how we analyzed our example from the Fastwater site, you can see that we made two important assumptions to come up with the "visit."

Making these assumptions is necessary because, to use computerspeak, the HTTP protocol used between web browsers and servers is "stateless."  Put more intelligibly, as far as the server is concerned, every request from a client to a server is completely unrelated to every other request.  There is no notion of "session" or "visit" built into HTTP -- so we have to manufacture "visits" by making assumptions.

How good are these assumptions?  Unfortunately, they are often not very good at all, especially for big sites handling many visitors simultaneously.  The biggest problem is with our first assumption that entries tied to a particular remote host represent the activities of one person.  This is a pretty good assumption if the visitor is from a small company with its own web server.  It is also a good bet for visitors using an Internet Service Provider (ISP) that assigns a unique, dynamic IP address to a user for as long as they are continuously connected to the web.  (They might have a different address the next time they log on.)  But it is a very poor assumption if the visitor is coming in from AOL, or from a big corporate site such as IBM.  In these cases the users are coming through a "proxy server" that gives them all the same IP address.

So, if you have 20 people visiting your site from AOL at a given time, their log file entries will all show the same identifier for the remote host.  And if they are all moving around the site in different ways, with paths crossing back and forth, there is no sure way to untangle the log file entries to create distinct paths for the different visitors -- or to even tell how many distinct visitors there were.  Knowing what browser they are using can help a little (maybe the different visitors from AOL are using different makes and versions of browsers -- hey, you could be lucky ...), but clearly is not enough information to really solve the problem.

For some busy consumer sites we have talked to in our case study work, we find that between 80% and 90% of the log file entries are indistinguishable one from another, and not useful for "path analysis" of the sequence of pages viewed by visitors.

Connecting Across Visits

If it is this hard to tie together log file entries for a single visit, you can probably guess that it becomes even more difficult to create a picture of users over successive visits, over time.  You not only have the difficulty caused by proxy servers, where many different users all get the same remote host ID, but you also encounter problems with other visitors coming in from ISPs that assign dynamic IP addresses.  These users have a unique ID that lasts for as long as they are connected to the web -- but the ID is very probably different if they dial back in and return the following day.

Bottom line:  identifying users across different visits, relying only the data in a log file, is a long shot for most visitors, and is especially so for "consumer" as opposed to "business" visitors.

Cookies and User Registration

These problems with tracking a user within a single visit, much less across visits, are why many sites want to place a "cookie" on the visitor's machine.  We will save a full discussion of cookies for another article -- for our purposes here you simply need to know that cookies are: Cookies can help you keep track of a particular visitor during a visit, and even across visits if you set cookies that last a while.  They can really add structure to the log file data, making it much more useful.  However, many users have a dim view of cookies, in part because some servers send cookies that can be returned to other servers, enabling a broader view of a user's movement to different sites on the net.  This kind of "tracking" cookie is very different from a cookie that only lasts for one visit..  Unfortunately, many visitors view cookies as all alike, all good or all bad, and so setting even a temporary cookie will be viewed as intrusive by at least a few visitors.

The obvious, cleanest way to track a visitor's identity is to ask them for it.  How to do this, and when to do this, is a good topic for yet another article.  For our purposes here it is sufficient to say that use of cookies or user registration is necessary if you really want to track visitors and customers across visits to your site.  The log file data, by itself, is simply not detailed enough to tell you much, particularly over time.  You need to create supplementary data, and then you need a way to tie it back to the log file data.

Why Do the Numbers Disagree?

We have talked to a number of businesses that use a couple of different log file analysis tools, from different vendors.  They find, routinely, that the different packages produce different results for important statistics such as the number of visits, times of visits, sources of visitors, and so on.  The differences sometimes lead to the question: "Why should I believe any of these numbers?  They all disagree!"

It's not really that bad.  The preceding discussion of the assumptions required to turn raw, disconnected log file entries into visits shows you where the problem is:  tying log file entries together means designing a heuristic that decides when records with the same remote server ID represent one person and when they represent more than one person.  You can build these heuristics in different ways -- for example, setting different time limits to determine when two log file records with the same server ID are probably from one visitor, or possibly from two visitors.  Different vendors develop different heuristics.  The heuristics are their trade secrets, and how they differentiate their products.

All of this means several things for you, as the buyer and user of such tools:

Do Log Files Scale Up for Large Sites?

If you go shopping for log file analysis tools, and talk to key vendors such as Accrue, Andromedia, Marketwave, Microsoft, and net.Genesis, you will hear a lot of talk about scaleability.  This talk will be intermixed with references to databases, data warehouses, network sniffers, and other matters.  The claims and counter claims pile up pretty fast and can be confusing unless you already have a good understanding of the basic issues.

You already know, from the preceding discussion of log file structure and analysis, most of the important facts relating to scaleability:

The combination of voluminous data and busy sites can produce some real difficulties.  In our case study research we have encountered companies that cannot handle more than a week's worth of log file data at any one time due to limitations of the database they are using to handle the log file analysis -- pretty disappointing for any company trying to get a picture of its web business.

The problems can go beyond the volume of data for big sites.  The reason is that very active sites will often use different servers, each with its own copy of the web site information, to balance the visitor load across different machines.  What this means, in practice, is that your request for a particular page might be served by one machine, an image on that page by another machine, and your next request for a page by a third machine.   Each of these machines is keeping its own log files.  How can you possibly get a single view of your business when the information is scattered across different log files?

One solution is to write software that "stitches" the log files back together to make a single file.  The idea is that you "merge" the three files, looking at the times of the entries, to create a single, master file.  The problem with doing this on a busy site is that log file entries only record times accurate to whole seconds.  There are many transactions happening across the servers during each second.  Consequently, there is no clearly correct way to do the merging of the files and to get sequences absolutely correct, even if you assume that the clocks on the different servers are perfectly synchronized.

An alternate solution offered by a number of vendors is to forget about the log files entirely and to place a "sniffer" on the network, downstream from all the servers, that looks at identifying information associated with each packet passing between the combined servers and the various browser clients making requests.  This solution also has the advantage of freeing your web server hardware from the task of continuously writing to the log files, potentially improving their performance as web servers.

There is much more to say about the problems encountered in collecting and analyzing data on large sites, and about the different solutions offered by different vendors.  We will try to cover most of these topics in subsequent issues of Fastwater Rapids.  For the moment, we simply want to make the point that the potential for running into scaling problems in using log files on big, active sites is real.  This means two things to you as a buyer if you have a big site:

The Bigger Picture: Log Files and Other Data

Typically, business questions go beyond the information derived from log file analysis.   Log files, by themselves, can tell you how many people visited which pages, and the sites that these people were on before coming to your site.  That can be interesting, but not as interesting as, say Each of these kinds of broader questions requires that you tie your log file data into other kinds of data, such as a sales databases, customer databases, or broader population survey data.  In other words, for most business questions, the log files are not sufficient -- you want to tie them into other systems.

This isn't surprising, when you think about it.  The log files are simply a record of web site activity, with weak connections to real prospects and customers.  So they need to be supplemented with other database information, about prospects, customers, and operations, so that you can relate your web site activity to the rest of your business.

What is surprising is that, today, most log file analysis tools do not showcase such integration.  Even if you ask about it, it is rare that the vendor shows you an interface that allows your own development team to write some Perl code to tie systems together.  Typically, making the connection between log file analysis and the rest of your business is handled as a separate integration effort, sometimes with substantial vendor involvement.  It is almost as if the vendors thought that the log files would be interesting in themselves, and have only added connections to other data sources as an afterthought.

We are sure this primitive state of affairs will change, and quickly, but you should be sure to define your integration requirements up front, and to think of your log file analysis as part of a bigger picture.  Most vendor demos won't help you with this.   You will typically have to develop the bigger picture on your  own.  Which is a good segue to our last issue ...

Getting the Most from Log Files

Let's begin by summarizing what we have said, so far,  about log files and log file analysis: This list moves from "promise" to "problem."  In our case study research we see web businesses following the same path -- from the promise of real insight to the problem of how to get any of it working.

The single most important piece of advice that we can give to a web business embarking on use of its log files is to decide, up front, what business questions need to be answered.  Make a list of the three or four key metrics that you need to monitor to ensure that your business is developing according to plan.  Then build a system that gets you that information -- and only that information.

What about "what if" questions?  What about "new insights?"  Remember two things:

  1. Log files are voluminous.  If you don't have a lot of traffic, you can probably afford to keep everything around for "what if" questions.  But if you have a busy web site, your goal should be to reduce your routine data collection and analysis to what you need to monitor your business efficiently
  2. If you have designed your core metrics right, they should show you if the business is acting in a surprising way -- whether it is good news or bad news.  When your core indicator metrics show that something interesting or important might be going on, then devise an experiment.  Start a special, limited time effort to collect and analyze log file data to shed light on your new questions.  Yes, do your "drill-down" and "what-if" analysis -- but do it only when your core, indicator metrics show you that you have something to look at.
Maybe the biggest surprise, after all, is not that business managers are unfamiliar with the details of log files.  Perhaps the real surprise, instead, is that they collect log file data without a clear idea of what they want to do with it.  It is not the log file data that is valuable, but the business questions that allow you to get good use out of the log file data.  Keep that thought firmly in mind, mix it in with the basic knowledge of what log files are all about that you have gathered here, and you will be among the small but growing number of businesses that are getting real benefit from the information they collect on their web sites.

  Do you have questions about log files that we did not answer here? Are you having difficulty getting what you want from your site data? Feel free to contact Bill Zoellick or other Fastwater partners.
 


[ Home | What We Do | Our Clients | Press & Events | Library | Contact Us ]