Log File Analysis at the Command Line

A collection of recipes for various log file analysis tasks like getting requests per day, Google bot visits, and more.
log files

Elias Dabbas


July 8, 2024

Getting the number of requests per day

The Plan

Every line in a log file contains information about a single request. Almost always we have the date and time of that request. We will create a pattern to match and extract the day, month, and year, and then count their values, giving us the number of requests our server recieved per day.

The Solution

requests_per_day() { egrep '[0-9][0-9]/[A-Z][a-z][a-z]/[0-9]{4}' $1  -o | sort | uniq -c; }

# Usage example:
requests_per_day /path/to/logfile.log

11382 12/Jun/2024
12270 13/Jun/2024
14898 14/Jun/2024
12497 15/Jun/2024
11420 16/Jun/2024
10744 17/Jun/2024
12080 18/Jun/2024
12256 19/Jun/2024
10929 20/Jun/2024
10820 21/Jun/2024
10967 22/Jun/2024
10794 23/Jun/2024
10703 24/Jun/2024
15211 25/Jun/2024
11569 26/Jun/2024

Windows (translated by ChatGPT):

function Get-RequestsPerDay {
    param (

    Select-String -Pattern '[0-9][0-9]/[A-Z][a-z][a-z]/[0-9]{4}' -Path $filePath |
    ForEach-Object { $_.Matches.Value } |
    Sort-Object |
    Group-Object |
    ForEach-Object {
            Count = $_.Count
            Date  = $_.Name

# Usage example:
Get-RequestsPerDay -filePath "C:\path\to\your\logfile.log"

Explanation (*NIX systems)

We match the pattern of two numbers, following by a forward slash, then a capital letter, then two lower-case letters, and then four numbers. Then the -o flag will extract the matched part. After that we sort, get uniques (with the -c flag) for counts.

See also

  • This recipe works for dates in the format DD/Mon/YYYY:H:M:S. For other date formats this pattern needs to be modified.

Getting and resolving Google IPs

The Plan

  • Filter all lines that contain “Googlebot”, and extract the IP addresses of those lines.
  • Remove duplicated IPs.
  • Run the host command on each of those IPs and save the result to a new file e.g. google_hosts.txt.
  • Create a function that runs through all these steps.

The Solution

1get_google_hosts() {
2google_ips=$(egrep -i 'Googlebot' $1 | cut -f 1 -d ' ' | sort | uniq);
for ip in $google_ips;
3host $ip >> google_hosts.txt;
Define/name the function
Take the first field of the lines that contain “Googlebot” (case-insensitive) from the file of choice, sort, uniqe.
Run the host command on all IPs, append to a new file “google_hosts.txt”

Sample output google_hosts.txt

google_hosts.txt domain name pointer crawl-66-249-74-13.googlebot.com. domain name pointer crawl-66-249-74-14.googlebot.com. domain name pointer crawl-66-249-74-15.googlebot.com. domain name pointer crawl-66-249-66-32.googlebot.com. domain name pointer crawl-66-249-66-33.googlebot.com. domain name pointer crawl-66-249-66-34.googlebot.com. domain name pointer crawl-66-249-66-35.googlebot.com. domain name pointer crawl-66-249-66-36.googlebot.com. domain name pointer crawl-66-249-66-37.googlebot.com. domain name pointer crawl-66-249-66-38.googlebot.com. domain name pointer crawl-66-249-66-39.googlebot.com.


The $1 corresponds to the first function argument provided. Here you need to provide the path to your log file. Using the cut command with the space as the delimter we extract the first field of all those filtered lines. This is the field containing the client IP address (which does not contain spaces). It’s important to note that we append to the google_hosts.txt file with >> instead of >, which would overwrite every previous line.

See also

The main pattern to identify Google bot requests is “Googlebot”. This works in many cases, but not all Google bots contain this exact pattern, you check out the complete listing of Google crawlers and bots for a more comprehensive match.