# Log Analysis

Discover patterns, trends, and anomalies in your logs with Gonzo's advanced analytical capabilities. From automatic pattern detection to time-series analysis, these features reveal insights that would be impossible to find manually.

{% hint style="info" %}
**Access Point:** Most log analysis features are accessed through the **Counts panel** (bottom-right). Press `Enter` on the Counts panel to open the detailed analysis modal.
{% endhint %}

### Analysis Overview

Gonzo's log analysis combines multiple sophisticated algorithms to provide comprehensive insights:

| Feature                   | Algorithm                | What It Reveals               | Best For                         |
| ------------------------- | ------------------------ | ----------------------------- | -------------------------------- |
| **Pattern Detection**     | Drain3 clustering        | Recurring log templates       | Finding common issues            |
| **Time-Series Analysis**  | 60-minute rolling window | Trends over time              | Understanding incident timing    |
| **Heatmap Visualization** | ASCII intensity mapping  | Activity patterns by severity | Visual pattern recognition       |
| **Service Distribution**  | Real-time aggregation    | Which services log what       | Multi-service debugging          |
| **Anomaly Detection**     | Statistical analysis     | Unusual patterns              | Proactive problem identification |

### The Analysis Dashboard

#### Counts Panel Overview

The Counts panel (bottom-right) provides your gateway to advanced analysis:

```
┌─ COUNTS ─────────────────────────────┐
│ Severity Distribution:               │
│ ERROR  █████████████████████   (45%) │
│ WARN   ██████████████          (30%) │
│ INFO   ███████                 (20%) │
│ DEBUG  ██                       (5%) │
│                                      │
│ Total Entries: 2,847                 │
│ Time Span: 2h 15m                    │
│ Entries/min: 21.2                    │
│ Pattern Count: 23                    │
│                                      │
│ Press Enter for detailed analysis... │
└─────────────────────────────────────┘
```

**Key Metrics Explained:**

* **Severity Distribution** - Percentage breakdown by log level with visual bars
* **Total Entries** - Count of all processed log entries in current session
* **Time Span** - Duration from first to last log entry
* **Entries/min** - Average logging frequency (useful for capacity planning)
* **Pattern Count** - Number of unique patterns detected by drain3 algorithm

#### Detailed Analysis Modal

Press `Enter` on the Counts panel to access the comprehensive analysis modal:

```
┌─ LOG ANALYSIS (Press ESC to close) ──────────────────────────────┐
│                                                                  │
│ Time-Series Heatmap (60-minute rolling window):                  │
│ ┌────────────────────────────────────────────────────────────┐   │
│ │Time: 60  50  40  30  20  10  0 (minutes ago)              │   │  
│ │ERROR ████░░██████░░░░████████████████ High intensity      │   │
│ │WARN  ░░██████░░████░░░░██████░░░░░░░░ Medium intensity    │   │
│ │INFO  ░░░░░░░░░░░░░░░░████░░░░░░░░░░░░ Low intensity       │   │
│ │DEBUG ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Minimal activity    │   │
│ └────────────────────────────────────────────────────────────┘   │
│                                                                  │
│ Top 3 Patterns by Severity:                                     │
│ ┌────────────────────────────────────────────────────────────┐   │
│ │ ERROR:                                                     │   │
│ │ 1. Database connection timeout (247 occurrences)          │   │
│ │ 2. User authentication failed (156 occurrences)           │   │
│ │ 3. API rate limit exceeded (89 occurrences)               │   │
│ │                                                            │   │
│ │ WARN:                                                      │   │
│ │ 1. Slow query detected (324 occurrences)                  │   │
│ │ 2. Memory usage high (198 occurrences)                    │   │
│ │ 3. Cache miss rate elevated (156 occurrences)             │   │
│ └────────────────────────────────────────────────────────────┘   │
│                                                                  │
│ Service Distribution:                                            │
│ web-api: 1,247 entries (44%)  database: 892 entries (31%)      │
│ auth: 654 entries (23%)       cache: 54 entries (2%)           │
│                                                                  │
│ Navigation: ↑/↓ Scroll, Mouse Wheel, ESC to close               │
└──────────────────────────────────────────────────────────────────┘
```

### Time-Series Heatmap Analysis

#### Understanding the Heatmap

The time-series heatmap is one of Gonzo's most powerful visual analysis tools:

**Time Axis (Horizontal):**

* Shows last 60 minutes in 1-minute buckets
* Reading: `60` = 60 minutes ago, `0` = current minute
* Updates in real-time as new logs arrive

**Severity Axis (Vertical):**

* Each row represents a different log severity level
* Separate tracking for ERROR, WARN, INFO, DEBUG, etc.
* Independent scaling per severity level

**Intensity Indicators:**

```
░ = Low activity (1-25% of max for this severity)
▒ = Medium activity (25-50% of max)
▓ = High activity (50-75% of max)  
█ = Very high activity (75-100% of max)
```

#### Reading Heatmap Patterns

{% tabs %}
{% tab title="Incident Detection" %}
**Identifying When Problems Started:**

```
Time: 60  50  40  30  20  10  0
ERROR ░░░░░░░░████████████░░░░░░
WARN  ░░░░░░████████████████░░░░
```

**Analysis:**

* Problem started around 40 minutes ago
* Peak error activity 30-20 minutes ago
* Warnings preceded errors (early warning signs)
* System appears to be recovering now

**Use Case:** Incident timeline reconstruction
{% endtab %}

{% tab title="Performance Patterns" %}
**Daily Performance Cycles:**

```
Time: 60  50  40  30  20  10  0
ERROR ░░░░░░░░░░░░░░░░░░░░░░░░░░
WARN  ░░░░████░░░░████░░░░████░░
INFO  ████░░░░████░░░░████░░░░░░
```

**Analysis:**

* Regular 20-minute cycles in warnings
* High info activity alternating with warnings
* Suggests scheduled job or batch processing
* No critical errors, normal operational pattern

**Use Case:** Capacity planning and optimization
{% endtab %}

{% tab title="Cascade Failures" %}
**System Failure Propagation:**

```
Time: 60  50  40  30  20  10  0
ERROR ░░░░░░░░░░█░░██████████████
WARN  ░░░░░░░░█████████████████░
INFO  ░░░░░░████████████░░░░░░░░
```

**Analysis:**

* Single error triggered cascade
* Warnings spread quickly after initial error
* Info logs dropped off (services became unresponsive)
* Classic cascade failure pattern

**Use Case:** System resilience analysis
{% endtab %}
{% endtabs %}

#### Heatmap Best Practices

**🔍 Investigation Techniques:**

1. **Start wide, zoom in** - Look for obvious patterns first
2. **Compare severity levels** - How do different levels correlate?
3. **Identify inflection points** - When did patterns change?
4. **Look for cycles** - Are there recurring patterns?

**⚡ Quick Analysis:**

```bash
# Quick heatmap analysis workflow:
1. Press Enter on Counts panel
2. Scan heatmap for obvious spikes or patterns
3. Note correlation between severity levels
4. Identify time ranges for deeper investigation
5. Use time information to filter main log view
```

### Pattern Detection with Drain3

#### How Drain3 Works

Gonzo uses the Drain3 algorithm for automatic pattern detection:

**What Drain3 Does:**

* **Clusters similar log entries** into pattern templates
* **Extracts variable parts** (IDs, timestamps, values) from static text
* **Maintains pattern counts** in real-time
* **Adapts to new patterns** as they appear

**Example Pattern Detection:**

```bash
# Original log entries:
"User 12345 login failed"
"User 67890 login failed"  
"User 54321 login failed"

# Drain3 pattern:
"User <ID> login failed" (3 occurrences)
```

#### Pattern Analysis Features

**Top Patterns by Severity:**

In the analysis modal, patterns are grouped by severity level:

```
ERROR Patterns:
1. Database connection timeout (247 occurrences)
2. User authentication failed (156 occurrences)  
3. API rate limit exceeded (89 occurrences)

WARN Patterns:
1. Slow query detected (324 occurrences)
2. Memory usage high (198 occurrences)
3. Cache miss rate elevated (156 occurrences)
```

**What This Tells You:**

{% tabs %}
{% tab title="Problem Prioritization" %}
**Focus on High-Count Patterns:**

* **Database connection timeout (247)** - Critical infrastructure issue
* **User authentication failed (156)** - Security/user experience impact
* **Slow query detected (324)** - Performance degradation

**Analysis Priority:**

1. Address database connectivity first (highest error count)
2. Investigate authentication system second
3. Optimize slow queries for long-term performance
   {% endtab %}

{% tab title="Root Cause Analysis" %}
**Pattern Correlation:**

* **High slow query warnings** might be causing database timeouts
* **Memory usage warnings** could be related to query performance
* **Cache misses** might be increasing database load

**Investigation Path:**

1. Check if slow queries correlate with connection timeouts
2. Verify if memory pressure affects query performance
3. Analyze cache efficiency impact on database load
   {% endtab %}

{% tab title="Trend Analysis" %}
**Pattern Evolution:**

* Are certain patterns increasing over time?
* Do patterns appear in clusters or continuously?
* Are new patterns emerging that weren't seen before?

**Long-term Monitoring:**

* Track pattern counts over days/weeks
* Identify patterns that are growing vs shrinking
* Spot new issues before they become critical
  {% endtab %}
  {% endtabs %}

#### Working with Pattern Data

**Pattern-Based Filtering:**

Once you identify interesting patterns, use them for focused analysis:

```bash
# From analysis modal, identify pattern:
"Database connection timeout"

# Create filter in main view:
/database.*connection.*timeout

# Or use structured filtering:
/error.*database.*timeout
```

**Pattern Evolution Tracking:**

```bash
# Compare patterns over time:
# 1. Note current top patterns
# 2. Wait 30-60 minutes  
# 3. Check analysis modal again
# 4. See which patterns increased/decreased
# 5. Identify trends and new issues
```

### Service Distribution Analysis

#### Understanding Service Metrics

The service distribution section shows which services are generating logs:

```
Service Distribution:
web-api: 1,247 entries (44%)    database: 892 entries (31%)
auth: 654 entries (23%)         cache: 54 entries (2%)
```

**What This Reveals:**

| Metric               | Meaning                | Investigation Questions             |
| -------------------- | ---------------------- | ----------------------------------- |
| **High Percentage**  | Service is very active | Is this normal? Performance issue?  |
| **Low Percentage**   | Service is quiet       | Is it supposed to be active? Down?  |
| **Sudden Changes**   | Activity shift         | What caused the change?             |
| **Missing Services** | Service not logging    | Is it running? Configuration issue? |

#### Service-Based Analysis

{% tabs %}
{% tab title="Load Distribution" %}
**Normal Load Patterns:**

```
web-api: 45%    # Frontend traffic - high normal
database: 30%   # Backend queries - moderate normal  
auth: 20%       # Authentication - low-moderate normal
cache: 5%       # Cache operations - low normal
```

**Red Flags:**

* Database suddenly becomes 60%+ (performance issue)
* Auth drops to 0% (service down)
* New unknown service appears with high percentage
  {% endtab %}

{% tab title="Problem Service Identification" %}
**Troublesome Patterns:**

```
# Before incident:
web-api: 40%, database: 35%, auth: 20%, cache: 5%

# During incident:
database: 70%, web-api: 25%, auth: 3%, cache: 2%
```

**Analysis:**

* Database percentage spike indicates it's struggling
* Web-api percentage drop suggests it's being blocked
* Auth/cache drops suggest they can't reach database
* Classic database bottleneck pattern
  {% endtab %}

{% tab title="Service Health Monitoring" %}
**Health Indicators:**

* **Consistent percentages** = Healthy steady state
* **Gradual changes** = Normal load evolution
* **Sudden spikes** = Performance issues or load shifts
* **Disappearing services** = Potential outages

**Monitoring Strategy:**

1. Note baseline percentages for each service
2. Watch for deviations > 20% from baseline
3. Investigate services with sudden percentage changes
4. Correlate with error patterns from same services
   {% endtab %}
   {% endtabs %}

### Advanced Analysis Techniques

#### Correlation Analysis

**Cross-Reference Multiple Data Points:**

```bash
# Workflow for comprehensive analysis:
1. Heatmap: When did issues occur?
2. Patterns: What specific issues happened?
3. Services: Which services were affected?
4. Main logs: Examine specific instances

# Example correlation:
# Heatmap shows error spike at 30 min ago
# Patterns show "database timeout" as top error
# Services show database % increased during that time
# Conclusion: Database performance issue at 30 min ago
```

#### Trend Identification

**Long-Term Pattern Analysis:**

```bash
# Daily analysis routine:
1. Open analysis modal first thing
2. Note current top patterns and service distribution
3. Compare with yesterday's patterns (manual tracking)
4. Identify:
   - New patterns that appeared
   - Patterns that increased in frequency
   - Services with changed activity levels
5. Investigate significant changes
```

#### Performance Baseline Establishment

**Creating Performance Baselines:**

```bash
# Week 1: Establish baselines
# Track normal patterns:
# - Typical service distribution percentages
# - Common pattern frequencies  
# - Normal heatmap intensity levels
# - Typical entries/minute rates

# Week 2+: Compare against baselines
# Alert on:
# - Service distribution changes > 25%
# - New high-frequency error patterns
# - Unusual heatmap spike patterns
# - Entry rate changes > 50%
```

### Analysis Workflows

#### Incident Investigation Workflow

```bash
# 1. Get timeline overview
# Press Enter on Counts panel
# Check heatmap for incident timing

# 2. Identify problem patterns  
# Look at top error patterns
# Note pattern frequencies

# 3. Determine affected services
# Check service distribution
# Identify services with unusual activity

# 4. Deep dive investigation
# Filter main log view by time: /2024-01-15.*14:[2-3][0-9]
# Filter by pattern: /database.*connection.*timeout
# Examine specific log entries

# 5. Root cause analysis
# Correlate timing + patterns + services
# Look for cascade failure indicators
# Check for external factors (deployments, load changes)
```

#### Performance Monitoring Workflow

```bash
# 1. Establish current state
# Review heatmap for activity patterns
# Note baseline service distributions
# Check current pattern frequencies

# 2. Set monitoring expectations
# Note normal intensity levels
# Track typical pattern counts
# Baseline entries/minute rate

# 3. Continuous monitoring
# Check analysis modal every 30-60 minutes
# Look for deviations from baseline
# Track trend directions (improving/degrading)

# 4. Proactive investigation
# Investigate patterns with increasing frequency
# Monitor services with changing activity levels
# Use early warning patterns (WARN level issues)
```

#### Capacity Planning Workflow

```bash
# 1. Long-term trend analysis
# Track entries/minute over weeks
# Monitor service distribution evolution
# Note peak activity periods

# 2. Pattern evolution tracking
# Which patterns are becoming more frequent?
# Are new error patterns emerging?
# How do patterns correlate with load?

# 3. Resource correlation
# Compare log activity with system metrics
# Identify log volume vs performance relationship
# Plan capacity based on log analysis trends
```

### Troubleshooting Analysis Issues

#### Performance Issues

**Analysis Modal Loading Slowly:**

```bash
# Causes and solutions:
# - Large datasets: Reduce log buffer size
# - Complex patterns: Reset data periodically
# - High update frequency: Increase update interval

# Optimization:
gonzo -f logs.log --log-buffer=5000 --update-interval=5s
```

**Pattern Detection Not Working:**

```bash
# Common issues:
# - Logs too unstructured: Pattern detection works best with consistent formats
# - High variability: Some logs have too many unique elements
# - Insufficient data: Need meaningful sample size

# Solutions:
# - Use structured logs (JSON, logfmt) when possible
# - Filter out highly variable elements before analysis
# - Allow more time for pattern establishment
```

#### Interpretation Issues

**Unclear Heatmap Patterns:**

```bash
# If heatmap seems random:
# - Check if time range is appropriate
# - Verify log timestamps are accurate
# - Consider if activity is actually irregular

# Debugging steps:
# 1. Focus on single severity level
# 2. Use smaller time windows
# 3. Check raw log timestamps
```

**Misleading Service Distribution:**

```bash
# Common causes:
# - Inconsistent service name extraction
# - Missing service identifiers in logs
# - Mixed log formats affecting parsing

# Solutions:
# - Standardize service naming in logs
# - Use structured logging with consistent fields
# - Manually filter by service if needed
```

### Best Practices

#### 🎯 **Effective Analysis Strategies**

1. **Start with overview, drill down** - Use heatmap for big picture, patterns for specifics
2. **Correlate multiple data sources** - Combine timing, patterns, and services
3. **Track trends over time** - Compare current state with historical baselines
4. **Focus on high-impact patterns** - Prioritize by frequency and severity

#### 📊 **Data Quality Optimization**

1. **Use structured logging** - JSON and logfmt provide better analysis
2. **Consistent field naming** - Helps with service distribution accuracy
3. **Meaningful log levels** - Proper ERROR/WARN/INFO usage improves analysis
4. **Include context** - Service names, trace IDs, and relevant metadata

#### ⚡ **Performance Optimization**

1. **Right-size analysis windows** - Balance detail with performance
2. **Reset data periodically** - Prevent memory buildup in long sessions
3. **Filter appropriately** - Reduce dataset size for complex analysis
4. **Monitor resource usage** - Adjust settings based on system capacity

### What's Next?

Now that you understand log analysis, explore these complementary features:

* **AI Integration** - Combine algorithmic analysis with AI insights
* **Format Detection** - Optimize data input for better analysis
* **Configuration** - Tune analysis settings for your needs
* **Integration Examples** - Apply analysis to real-world scenarios

***

**You now have mastery over Gonzo's analytical capabilities!** 🚀 The combination of time-series analysis, pattern detection, and service distribution gives you unprecedented insight into log behavior and system health.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.controltheory.com/backup/log-analysis.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
