Andrew Abok Follow

Log File Analysis Research

Introduction

Formal specifications provide a clear framework for defining software behavior, yet connecting these specifications to actual programs in an automated and practical manner remains a significant challenge, especially in languages without well-defined semantics (Andrews, 1998). This project leverages log file analysis as a practical approach to bridge this gap, enabling the extraction of meaningful insights from program observables to improve software reliability and debugging.

Logs, which capture system run-time information, provide valuable insights into system behavior and issues. Traditional manual methods, like keyword searches and rule-based matching, struggle with the growing volume and complexity of modern systems. Leveraging LLMs and NLP enables efficient and accurate extraction of insights, reducing reliance on manual effort.

Challenges in Log Analysis for Modern Systems

Complex System Behaviors: Modern systems are large-scale and parallel, making it difficult for developers—who often work on sub-components—to fully understand overall system behavior. This incomplete understanding complicates issue identification from logs (He et al., 2016).
Sheer Volume of Logs: Systems generate massive amounts of logs (e.g., 50 GB/hour), making manual analysis impractical. Tools like search and grep are insufficient to filter key information from noise (He et al., 2016).
Ineffective Traditional Methods: Fault-tolerant mechanisms, such as redundant tasks and speculative executions, render keyword-based searches ineffective. This leads to false positives, increasing the effort required for manual inspection (He et al., 2016).

Automated Log Analysis

The process of log analysis for anomaly detection involves four main steps: log collection, log parsing, feature extraction, and anomaly detection.

Overall Framework

Log Reading

The first step in log analysis is to read and parse raw log files into structured data. This involves extracting key components such as date, time, log level, and log message from each log line.

1.1. Key Components

Each log line is parsed into the following components:

Date: The date when the log entry was generated.
Time: The time when the log entry was generated.
Level: The severity level of the log (e.g., INFO, ERROR, WARN).
Log Message: The actual content of the log.

1.2. Log Parsing Process

The parsing process involves the following steps:

Pattern Matching:

Log files are parsed using regex patterns to extract the four key components.
Default patterns are predefined for common log formats.
If the default patterns fail, an LLM agent is used to generate a custom regex pattern based on the first line of the log.

Line-by-Line Processing:

Log files are processed line by line.Multi-line log entries are concatenated into a single log message.

1.3. Pseudocode

1. Initialize LogProcessor with default patterns.
2. For each log file:
   a. Read the first line to determine the log format.
   b. If the default pattern matches:
      - Parse the log line into date, time, level, and message.
   c. Else:
      - Use an LLM agent to generate a custom regex pattern.
      - Parse the log line using the custom pattern.
3. Return structured log data.

Log parsing

Logs are unstructured, which contain free form text. The purpose of log parsing is to extract a group of event templates, whereby raw logs can be structured. More specifically, each log message can be parsed into a event template (constant part) with some specific parameters (variable part).

2.1. Constant vs. Variable Parts

Constant Parts are Static text that remains the same across log entries
Variable Parts: Dynamic values that change between log entries

Lets look at example using the Innova market stats api log file:

Raw Log: NSEEquities will run again after 10 minutes"

Parsed Template: "* [INF] NSEEquities will run again after * minutes"

2.2 Log Parsing Methods

There are two main approaches to log parsing:

A. Heuristic-Based Parsing

How It Works:
1. Counts word frequencies at each position in the log.
2. Identifies frequent words as constant parts.
3. Replaces variable parts with *.

Pseudocode:

def heuristicBasedParsing(logs):
    wordCounts = countWordFrequencies(logs)
    constantParts = identifyConstantParts(wordCounts)
    templates = [replaceVariableParts(log, constantParts) for log in logs]
    return templates

B. Clustering-Based Parsing

How It Works:
1. Groups similar logs into clusters using k-means.
2. Generates templates by identifying constant and variable parts within each cluster.

Pseudocode:

def clusteringBasedParsing(logs, n_clusters):
    logVectors = vectorizeLogs(logs)
    clusters = clusterLogs(logVectors, n_clusters)
    templates = [generateTemplateFromCluster(cluster) for cluster in clusters]
    return templates

2.3. Template Generation

Templates are generated by comparing words at the same position across log lines:

If all words at a position are the same, it’s a constant part.
If words differ, it’s a variable part (replaced with *).

Pseudocode:

def generateTemplateFromCluster(clusterLogs):
    words = [log.split() for log in clusterLogs]
    minLength = min(len(line) for line in words)
    template = []
    
    for i in range(minLength):
        columnWords = [line[i] for line in words]
        if allWordsSame(columnWords):
            template.append(columnWords[0])  # Constant part
        else:
            template.append('*')  # Variable part
    return ' '.join(template)

4. Example

Input Logs:

2024-06-30 00:02:06.899 [INF] NSEEquities will run again after 10 minutes
2024-06-30 00:02:14.118 [INF] GSEEquities Importer started!
2024-06-30 00:02:15.138 [INF] LUSEEquities Importer started!
2024-06-30 00:02:20.467 [INF] LUSEEquities [match] 12444: FW: ZCCM-IH NOTICE OF DIVIDEND: 1 attachment(s)
2024-06-30 00:02:21.717 [INF] LUSEEquities [match] 12448: FW: WEEKLY REPORT - 28.06.2024: 1 attachment(s)
2024-06-30 00:02:21.724 [INF] LUSEEquities will run again after 13.29 minutes

Parsed Templates:

* [INF] NSEEquities will run again after * minutes
* [INF] GSEEquities Importer started!
* [INF] LUSEEquities Importer started!
* [INF] LUSEEquities [match] *: *: *: * attachment(s)

Feature extraction

After parsing logs into individual events, they are grouped into sequences using techniques like fixed, sliding, or session windows. Each sequence is then converted into a numerical feature vector, representing the frequency of specific events. These vectors are combined to form a feature matrix, which serves as input for machine learning models. This structured approach enables effective analysis and pattern detection in log data.

The goal of feature extraction is to transform parsed log events into numerical features that can be used as input for anomaly detection models. This process involves grouping logs into sequences and generating an event count matrix.

3.1. Windowing Techniques

Logs are grouped into sequences using windowing techniques. Each sequence represents a finite chunk of log data.This can be achieved in two ways

A. Fixed Window - Divides logs into non-overlapping chunks based on a fixed time interval (e.g., 1 hour or 1 day).
B. Sliding Window - Divides logs into overlapping chunks based on a window size and step size (e.g., hourly windows sliding every 5 minutes).
C. Session Window - Groups logs based on identifiers (e.g., block_id in HDFS logs) rather than timestamps.

3.2. Event Count Matrix

After grouping logs into sequences, an event count matrix is generated. Each row in the matrix represents a log sequence, and each column represents a log event. The value at position (i, j) indicates how many times event j occurred in sequence i.

Let:

$n$ = number of log sequences.
$m$ = number of unique events
$X$ = event count matrix of size $n \times m$

Pseudocode:

def generateEventCountMatrix(logSequences, eventTemplates):
    eventCountMatrix = []
    for sequence in logSequences:
        eventCount = [sequence.count(event) for event in eventTemplates]
        eventCountMatrix.append(eventCount)
    return eventCountMatrix

Unsupervised Anomaly Detection Using PCA

Principal Component Analysis (PCA) is used for unsupervised anomaly detection. PCA reduces the dimensionality of the event count matrix and identifies anomalies based on deviations from normal behavior.

4.1. PCA Overview

PCA is a statistical method that projects high-dimensional data onto a lower-dimensional space while preserving the maximum variance. The methodology underpinning the dimensionality reductions are:

Dimensionality Reduction:

Project the event count matrix $X$ onto a lower-dimensional space using the first $k$ k principal components.
The principal components are the eigenvectors of the covariance matrix of $X$

Anomaly Detection: Anomalies are identified by measuring the Squared Prediction Error (SPE) in the residual space.

4.2. Mathematical Framework

Step 1 : PCA Transformation

Compute the covariance matrix $C$ of $X$ $C = \frac{1}{n} X^{T} X$
Perform eigen value decomposition on $C$ $C = V Λ V^{T}$

Where:

$V$ = matrix of eigenvectors(principal components)
$Λ$ = diagonal matrix of eigenvalues

Select the first $k$ eigenvectors $V_{k} = [v_{1}, v_{2}, \dots, v_{k}]$ that captures 95% of the variance.
Project $X$ onto the principal component space:

Y = X V_{k}

Step 2 : Anomaly Detection

Compute the residual space $S_{a}$ : $S_{a} = I - V_{k} V_{k}^{T}$
Project $X$ onto the residual space: $y_{A} = X S_{a}$
Calculate the Squared Prediction Error (SPE) $S P E = | | y_{a} | |^{2}$
classify anomalies
- if $S P E > Q_{a}$ the sequence is an anomaly.
- $Q_{a}$ is the threshold calculated using the chi-squared distribution: $Q_{a} = χ_{1 - α}^{2} (m - k)$ Where:
- $α$ is the significance level
- $m$ is the original number of features.
- $k$ is the number of principal component

Pseudocode

Initialize PCA model with variance threshold

Fit PCA model to event count matrix:

Compute principal components.

Project data onto residual space.

Calculate Squared Prediction Error (SPE) for each sequence:

Compute $SPE = y_a ^2 $, w h e r e$ y_a$ is the projection onto residual space

Classify anomalies:

if $S P E > Q_{α}$ mark sequence as anomalous.

$Q_{α}$ is calculated using the chi-squared distribution.

Return SPE scores and anomaly labels.

Error Clustering

After parsing the logs, the next step is to cluster similar error messages to identify common patterns and reduce noise. This is achieved using TF-IDF vectorization and K-Means clustering.

4.1. Clustering Process

TF-IDF Vectorization: Here we convert error messages into numerical vectors based on word frequency.Rare words are given higher importance, while common words are down weighted.
K-Means Clustering: given the above constructed vector we further group error messages into clusters based on their vector representations.Each cluster represents a group of similar error messages.
Representative Errors: So for each cluster, the error message closest to the cluster centroid is selected as the representative error.Representative errors summarize the common patterns in each cluster.

4.2. Pseudocode

Initialize LogErrorClusterer with the number of clusters.
Extract error messages from parsed logs.
Convert error messages into TF-IDF vectors.
Apply K-Means clustering to group similar errors.
For each cluster:
   a. Calculate the centroid.
   b. Select the error message closest to the centroid as the representative.
Return clusters and representative errors.

References

Andrews, J. H. (1998, October). Testing using log file analysis: tools, methods, and issues. In Proceedings 13th IEEE International Conference on Automated Software Engineering (Cat. No. 98EX239) (pp. 157-166). IEEE.
He, S., Zhu, J., He, P., & Lyu, M. R. (2016, October). Experience report: System log analysis for anomaly detection. In 2016 IEEE 27th international symposium on software reliability engineering (ISSRE) (pp. 207-218). IEEE.

21 Jul 2025

Machine Learning

« Diving into WebAssembly: What It Is and Why It Matters Quit Social Media »