NAME

httpsum - Summarize apache log files


SYNOPSIS

httpsum -d /path/to/logfiles [OPTIONS] [LOG_FILES...]


DESCRIPTION

httpsum strives to analyze log files and give you just the results you care about. Too many log file analyzers exist but few let you perform proper filtering and aggregation down to the level of just the page-hits for the pages you care about. httpsum tries to fix that by simply dumping a summary table of the results after various filters and transformations have been applied.


OPTIONS

-d PATH

Specifies the directory to look in for logfiles.

-S suffix

Specifies the filename suffix to require. Useful for looking for log entries for a specific date, if the logs are rotated daily. EG:

  httpsum -S .`date -d yesterday +%Y-%m-%d`
-c XML_CONFIG_FILE

This is fundamental configuration file used to decide what sites are being analyzed, how to interpret the logs, etc. See the XML_CONFIG_FILE section below for complete details on the format of this file.

If not specified then the default $HOME/.httpsum/config.xml file will be used.

-s SITE

Assume all the logs read are for a single SITE. Normally the XML_CONFIG_FILE can identify multiple sites to report information about, but this option allows the output to be limited to a single site.

-I INCLUDE_PATH

Specifies an optional include path to use when using the <include...> directive of the XML_CONFIG_FILE.

--debug

Extra verbose debugging about exactly what httpsum is doing.

-D

Dumps the configuration file per-site that is being used to analyze the log files for that site. This reports both the global and site specific options as finally combined.


XML_CONFIG_FILE

The XML_CONFIG_FILE is a configuration file that dictates how reporting should be done and for what sites. If no file is specified via the command line then httpsum will look in ~/.httpsum.xml.

The contents of the file will take the following high-level format:

  <httpsum>
    <global>
      <!-- Global options that apply to all sites -->
    </global>
    <sites>
      <!-- Sites to analyze and site-specific options>
    </sites>
  </httpsum>

Any directive below can appear in either the site-specific section or in the global section. Global options will map to each site, but site-specific options will only apply to that individual site.

<file>FILES</file>

Specifies the files to read; possibly with wild-card matching. A special %{site} keyword can be used when placed in the global section to add the site name into the file pattern.

Example:

  <file>/var/log/httpd/%{site}/access.log.*</file>

Note that this directive is ignored if log files are specified on the command line instead.

<ignorehost>HOST</ignorehost>

Ignores hosts of a particular address. EG,

  <ignorehost>127.0.0.1</ignorehost>

Will not analyze log file lines generated from requests from the localhost.

<ignorereferrer>REFERER</ignorereferrer>

If you wish to ignore accesses that were referred to from particular location this token will let you do that. This is handy for only analyzing incoming requests that came from a remote or bookmarked location, for example. By ignoring the site name itself it'll ensure that first incoming connections are examined.

<ignorefile>FILE</ignorefile>

Ignores requests to FILES (really path components). This is useful for ignoring common files that provide no useful data, like CSS files or image directories. The FILE specifier is actually a regular expression so expressions like "\.css$" and "^/2010/.*/foo$" are valid expressions.

<agent name="MATCH_REGEXP" bot="1|0">NAME</agent>

If a given agent name matches the MATCH_REGEXP regular expression then it will be translated into NAME when analyzed. This is most useful when the bot attribute is set to 1 as the web crawling bot hit will not be counted as a normal hit and will simply be summarized in the bot specific output section.

<transformfile name="REGEXP">REPLACEMENT</transformfile>

Replaces a URL with an alternative version. This is designed to make longer URL strings easier to read.

For example, take the complex gallery2 URL that doesn't make much sense to look at quickly and the following line will transform the URL into a much more simple to read "Image: NUMBER" lien:

    <transformfile name=".*core.Download.*g2_itemId=(\d+).*">Image: $1</transformfile>
<transformreferer name="REGEXP">REPLACEMENT</transformreferer>

Similar to transformfile, but applies to referer strings.

For example,

    <transformreferer name=".*facebook.*share.*">facebook: share</transformreferer>

Will translate any "share" item from a facebook referer into a simple "facebook: share" string so that you simply receive a count of how many times this page was hit by someone "sharing" it on facebook.

<include src="FILE" />

This includes the contents of another file into the one currently being processed. The FILE may refer to a complete file or path-name. If it can't be immediately found then files in the following search paths will be checked for:

  .
  -I switch paths if given
  $HOME/.httpsum
  /usr/share/httpsum/include-modules
  /usr/local/share/httpsum/include-modules"

Some include files that may be of interest are distributed with the httpsum application; see further below for details.

Note that this is an easy-to-use include statement that is not fully XML-legal (if you wish to use a XML-legal syntax, please see the use of XML "entities" in XML language documentation).


HTTPSUM DISTRIBUTED INCLUDE FILES

The following include files are distributed with httpsum:

agents.xml

A file containing many of the common bot/web-crawlers. It is highly recommended you include this file in the <global> section of your configuration.

transforms.xml

A list of transformations converting complex URLs into easy-to-read outputs. For example, search engines are converted from their full URL to strings like "engine: word1+word2...".

type-wordpress.xml

Contains useful exclude patterns for including in wordpress sites.

type-gallery2.xml

Contains useful exclude patterns for including in gallery2 sites. Also transforms certain URL patterns into easy-to-read results like "Image: NUMBER".


EXAMPLE

Consider the following configuration file:

  <httpsum>
    <global>
      <include src="agents.xml" />
    </global>
    <sites>
      <site name="capturedonearth.com">
        <include>type-gallery2.xml</include>
      </site>
    </sites>
  </httpsum>

Then the following shows example output when run as follows on the log-files from the http://capturedonearth.com/ website:

 # httpsum -c config.xml capturedonearth.com/access.log.2010-05-1*
 ----- capturedonearth.com -----
 Bot hits:
      8 Ask Jeeves
      3 SurveyBot
    548 dotnetdotcom
     21 Twiceler
   1663 MSN
   ...
 Hits:
      1 Item: 6379
      1 core: DownloadItem - 6831
        1 Item: 81
      1 core: DownloadItem - 6647
        1 Item: 8545
      ...
     28 Item: 8545
        1 http://twitter.com/
        1 Item: 8549
        1 Item: 8545
        1 http://twitter.com/NevadaWolf/geocachers
        6 http://touch.facebook.com/
     62 slideshow: DownloadPicLens
        1 Item: 7128
        1 slideshow: Slideshow - 35
        ...
        2 slideshow: Slideshow - 6638
        3 slideshow: Slideshow - 6363
        8 http://capturedonearth.com/main.php
       10 Item: 8545
       13 Item: 8549

The indented lines are the referring site. EG, picture number 8545 was referred to 6 times by http://touch.facebook.com/.


AUTHOR

Wes Hardaker < hardaker AT users DT sourceforge TOD net >


COPYRIGHT and LICENSE

Copyright 2009-2010, Wes Hardaker. All rights reserved.

httpsum is free software; you can redistribute it and/or modify it under the same terms as Perl itself.