logstash Linux: Complete Guide to Data Processing Pipeline

Table of Contents

What is Logstash?

Logstash is a powerful, open-source data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and sends it to your favorite “stash” like Elasticsearch. Part of the Elastic Stack (formerly ELK Stack), Logstash excels at collecting, parsing, and transforming logs and events from various sources into a common format.

On Linux systems, Logstash serves as the central hub for data processing, capable of handling everything from simple log forwarding to complex data enrichment and transformation tasks. It’s designed to handle data from any source, in any format, with over 200 plugins available for different inputs, filters, and outputs.

Installing Logstash on Linux

Installation via Package Manager (Recommended)

The most straightforward way to install Logstash on Linux is through the official Elastic repository:

# Add Elastic repository GPG key
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

# Add repository to sources list
echo "deb https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-8.x.list

# Update package list and install
sudo apt update
sudo apt install logstash

For Red Hat-based systems (CentOS, RHEL, Fedora):

# Add Elastic repository
sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch

# Create repository file
cat << EOF | sudo tee /etc/yum.repos.d/elastic.repo
[elastic-8.x]
name=Elastic repository for 8.x packages
baseurl=https://artifacts.elastic.co/packages/8.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md
EOF

# Install Logstash
sudo yum install logstash

Manual Installation

For manual installation, download the appropriate package:

# Download Logstash (replace version as needed)
wget https://artifacts.elastic.co/downloads/logstash/logstash-8.11.0-linux-x86_64.tar.gz

# Extract archive
tar -xzf logstash-8.11.0-linux-x86_64.tar.gz

# Move to desired location
sudo mv logstash-8.11.0 /opt/logstash

# Create symlink for easier access
sudo ln -s /opt/logstash/bin/logstash /usr/local/bin/logstash

Logstash Configuration Fundamentals

Logstash configurations follow a simple three-section structure: input, filter, and output. Each section defines how data flows through the pipeline.

Basic Configuration Structure

input {
  # Define data sources
}

filter {
  # Transform and enrich data
}

output {
  # Send processed data to destinations
}

Configuration File Location

Logstash configuration files are typically stored in:

/etc/logstash/conf.d/ – Main configuration directory
/etc/logstash/logstash.yml – Main settings file
/etc/logstash/pipelines.yml – Pipeline definitions

Essential Logstash Commands

Starting and Managing Logstash

# Start Logstash service
sudo systemctl start logstash

# Enable auto-start on boot
sudo systemctl enable logstash

# Check service status
sudo systemctl status logstash

# Stop Logstash
sudo systemctl stop logstash

# Restart Logstash
sudo systemctl restart logstash

Running Logstash with Custom Configuration

# Run with specific configuration file
/usr/share/logstash/bin/logstash -f /path/to/config.conf

# Test configuration syntax
/usr/share/logstash/bin/logstash -f /path/to/config.conf --config.test_and_exit

# Run in debug mode
/usr/share/logstash/bin/logstash -f /path/to/config.conf --log.level debug

Input Plugins and Configuration

File Input Plugin

The file input plugin is one of the most commonly used inputs for reading log files:

input {
  file {
    path => "/var/log/apache2/access.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => "plain"
  }
}

Expected Output: Logstash will continuously monitor the specified log file and process new entries as they’re written.

Beats Input Plugin

For receiving data from Beats (Filebeat, Metricbeat, etc.):

input {
  beats {
    port => 5044
    host => "0.0.0.0"
  }
}

Syslog Input Plugin

To receive syslog messages over the network:

input {
  syslog {
    port => 514
    type => "syslog"
  }
}

TCP Input Plugin

For receiving data over TCP connections:

input {
  tcp {
    port => 9999
    codec => json_lines
  }
}

Filter Plugins for Data Processing

Grok Filter

Grok is the primary filter for parsing unstructured log data into structured data:

filter {
  grok {
    match => { "message" => "%{COMMONAPACHELOG}" }
  }
}

Custom grok patterns for specific log formats:

filter {
  grok {
    match => { 
      "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" 
    }
  }
}

Date Filter

Parse timestamps and set the @timestamp field:

filter {
  date {
    match => [ "timestamp", "yyyy-MM-dd HH:mm:ss" ]
    target => "@timestamp"
  }
}

Mutate Filter

Modify fields, rename, remove, or add fields:

filter {
  mutate {
    rename => { "old_field" => "new_field" }
    remove_field => [ "unwanted_field" ]
    add_field => { "environment" => "production" }
    convert => { "response_time" => "integer" }
  }
}

Conditional Processing

Apply filters conditionally based on field values:

filter {
  if [type] == "apache" {
    grok {
      match => { "message" => "%{COMMONAPACHELOG}" }
    }
  } else if [type] == "nginx" {
    grok {
      match => { "message" => "%{NGINXACCESS}" }
    }
  }
}

Output Plugins and Destinations

Elasticsearch Output

Send processed data to Elasticsearch:

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "logstash-%{+YYYY.MM.dd}"
    document_type => "_doc"
  }
}

File Output

Write processed data to files:

output {
  file {
    path => "/var/log/logstash/processed-%{+YYYY-MM-dd}.log"
    codec => line { format => "%{timestamp} %{level} %{message}" }
  }
}

Stdout Output (for Testing)

Display output in the console for debugging:

output {
  stdout {
    codec => rubydebug
  }
}

Real-World Configuration Examples

Apache Log Processing Pipeline

Complete configuration for processing Apache access logs:

input {
  file {
    path => "/var/log/apache2/access.log"
    start_position => "beginning"
    sincedb_path => "/var/lib/logstash/sincedb_apache"
    type => "apache"
  }
}

filter {
  if [type] == "apache" {
    grok {
      match => { "message" => "%{COMMONAPACHELOG}" }
    }
    
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
    
    if [response] {
      mutate {
        convert => { "response" => "integer" }
      }
    }
    
    if [bytes] {
      mutate {
        convert => { "bytes" => "integer" }
      }
    }
    
    mutate {
      remove_field => [ "timestamp", "message" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "apache-logs-%{+YYYY.MM.dd}"
  }
  
  stdout {
    codec => rubydebug
  }
}

Multi-Input Pipeline

Configuration handling multiple log sources:

input {
  file {
    path => "/var/log/nginx/access.log"
    type => "nginx"
    tags => ["nginx", "access"]
  }
  
  file {
    path => "/var/log/nginx/error.log"
    type => "nginx"
    tags => ["nginx", "error"]
  }
  
  file {
    path => "/var/log/syslog"
    type => "syslog"
    tags => ["system"]
  }
}

filter {
  if "nginx" in [tags] and "access" in [tags] {
    grok {
      match => { "message" => "%{NGINXACCESS}" }
    }
  } else if "nginx" in [tags] and "error" in [tags] {
    grok {
      match => { "message" => "%{NGINXERROR}" }
    }
  } else if [type] == "syslog" {
    grok {
      match => { "message" => "%{SYSLOGTIMESTAMP:timestamp} %{IPORHOST:host} %{PROG:program}: %{GREEDYDATA:message}" }
      overwrite => [ "message" ]
    }
  }
  
  date {
    match => [ "timestamp", "MMM dd HH:mm:ss", "MMM  d HH:mm:ss" ]
  }
}

output {
  if [type] == "nginx" {
    elasticsearch {
      hosts => ["localhost:9200"]
      index => "nginx-%{+YYYY.MM.dd}"
    }
  } else if [type] == "syslog" {
    elasticsearch {
      hosts => ["localhost:9200"]
      index => "syslog-%{+YYYY.MM.dd}"
    }
  }
}

Performance Tuning and Optimization

Pipeline Configuration

Optimize Logstash performance by tuning pipeline settings in /etc/logstash/logstash.yml:

# Pipeline workers (usually CPU cores)
pipeline.workers: 4

# Batch size for processing
pipeline.batch.size: 1000

# Batch delay
pipeline.batch.delay: 50

# Pipeline buffer size
pipeline.queue.max_bytes: 1gb

# Enable persistent queues
queue.type: persisted

JVM Settings

Configure JVM heap size in /etc/logstash/jvm.options:

# Set heap size (should be 50% of available RAM)
-Xms2g
-Xmx2g

# Garbage collection settings
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200

Monitoring and Troubleshooting

Monitoring APIs

Logstash provides APIs for monitoring:

# Check node info
curl -XGET "localhost:9600/_node?pretty"

# Check pipeline stats
curl -XGET "localhost:9600/_node/stats/pipelines?pretty"

# Check hot threads
curl -XGET "localhost:9600/_node/hot_threads?pretty"

Log File Monitoring

Monitor Logstash logs for troubleshooting:

# View Logstash logs
sudo tail -f /var/log/logstash/logstash-plain.log

# Check for errors
sudo grep -i error /var/log/logstash/logstash-plain.log

# Monitor with journalctl
sudo journalctl -u logstash -f

Common Troubleshooting Commands

# Test configuration syntax
sudo -u logstash /usr/share/logstash/bin/logstash --path.settings /etc/logstash -t

# Check configuration files
sudo /usr/share/logstash/bin/logstash --path.settings /etc/logstash --config.test_and_exit

# Run in debug mode
sudo -u logstash /usr/share/logstash/bin/logstash --path.settings /etc/logstash --log.level debug

Security Considerations

File Permissions

Ensure proper file permissions for security:

# Set ownership for Logstash files
sudo chown -R logstash:logstash /etc/logstash/
sudo chown -R logstash:logstash /var/log/logstash/
sudo chown -R logstash:logstash /var/lib/logstash/

# Set proper permissions
sudo chmod 640 /etc/logstash/conf.d/*.conf
sudo chmod 600 /etc/logstash/logstash.yml

Network Security

Configure secure communication with Elasticsearch:

output {
  elasticsearch {
    hosts => ["https://localhost:9200"]
    user => "logstash_writer"
    password => "secure_password"
    ssl => true
    ssl_certificate_verification => true
    cacert => "/path/to/ca.crt"
  }
}

Integration with Elastic Stack

Filebeat to Logstash

Configure Filebeat to send data to Logstash:

# Filebeat configuration
output.logstash:
  hosts: ["localhost:5044"]
  
input {
  beats {
    port => 5044
  }
}

Logstash to Kibana

Data processed by Logstash and stored in Elasticsearch is automatically available in Kibana for visualization and analysis.

Advanced Features

Multiple Pipelines

Configure multiple pipelines in /etc/logstash/pipelines.yml:

- pipeline.id: apache
  path.config: "/etc/logstash/conf.d/apache.conf"
  pipeline.workers: 2
  
- pipeline.id: nginx  
  path.config: "/etc/logstash/conf.d/nginx.conf"
  pipeline.workers: 2

Dead Letter Queue

Handle failed events with dead letter queue:

# Enable in logstash.yml
dead_letter_queue.enable: true
dead_letter_queue.max_bytes: 1gb

# Process dead letter queue
input {
  dead_letter_queue {
    path => "/var/lib/logstash/dead_letter_queue"
    pipeline_id => "main"
  }
}

Best Practices

Configuration Management: Use version control for configuration files
Resource Monitoring: Monitor CPU, memory, and disk usage regularly
Field Naming: Use consistent field naming conventions
Error Handling: Implement proper error handling and logging
Testing: Test configurations in development environments first
Documentation: Document custom grok patterns and configurations
Security: Regularly update Logstash and secure network communications

Logstash on Linux provides a robust foundation for building scalable data processing pipelines. By following these examples and best practices, you can create efficient log processing systems that handle large volumes of data while maintaining performance and reliability. Regular monitoring and optimization ensure your Logstash deployment continues to meet your data processing requirements as they evolve.

logstash Linux: Complete Guide to Data Processing Pipeline

What is Logstash?

Installing Logstash on Linux

Installation via Package Manager (Recommended)

Manual Installation

Logstash Configuration Fundamentals

Basic Configuration Structure

Configuration File Location

Essential Logstash Commands

Starting and Managing Logstash

Running Logstash with Custom Configuration

Input Plugins and Configuration

File Input Plugin

Beats Input Plugin

Syslog Input Plugin

TCP Input Plugin

Filter Plugins for Data Processing

Grok Filter

Date Filter

Mutate Filter

Conditional Processing

Output Plugins and Destinations

Elasticsearch Output

File Output

Stdout Output (for Testing)

Real-World Configuration Examples

Apache Log Processing Pipeline

Multi-Input Pipeline

Performance Tuning and Optimization

Pipeline Configuration

JVM Settings

Monitoring and Troubleshooting

Monitoring APIs

Log File Monitoring

Common Troubleshooting Commands

Security Considerations

File Permissions

Network Security

Integration with Elastic Stack

Filebeat to Logstash

Logstash to Kibana

Advanced Features

Multiple Pipelines

Dead Letter Queue

Best Practices

Related Posts

filebeat Linux: Complete Guide to Ship Log Files to Elasticsearch with Real Examples

Beats Linux: Complete Guide to Data Shippers for Elasticsearch Integration

elasticsearch Linux: Complete Guide to Search and Analytics Engine Installation and Usage

Kibana Linux: Complete Guide to Data Visualization Dashboard Setup and Management

rsyslog Command Linux: System Logging Configuration with Examples

Winlogbeat Linux: Complete Guide to Shipping Windows Event Logs

Functionbeat Linux: Complete Guide to Serverless Deployment and Monitoring

auditbeat Linux: Complete Guide to Shipping Audit Data to Elasticsearch

metricbeat Linux: Complete Guide to System and Service Metrics Monitoring

Telegraf Linux: Complete Guide to Installing and Configuring the Metrics Collection Agent

ndoutils Linux: Complete Guide to Nagios Data Output Utilities

ganglia Linux: Complete Guide to Distributed Monitoring System

Continue Reading

Using Desired State Configuration (DSC) in PowerShell: Complete Guide to Infrastructure Automation

Using PowerShell in DevOps: Complete Guide to CI/CD, Pipelines, and Infrastructure as Code

Using PowerShell with Git and Version Control for Scripting Projects: Complete Integration Guide

Packaging and Deploying PowerShell Modules: Complete Guide to Pester Tests, NuGet, and PSGallery Publishing

Managing PowerShell Modules Across Platforms: Complete Cross-Platform Guide for Windows, macOS, and Linux

Using PowerShell in Containers and Cloud Environments: Complete Guide to Modern DevOps Automation