How Many Logs Are Enough? A Backend Engineer's Guide to Logging

C

cuongkane

@cuongkane

How Many Logs Are Enough? A Backend Engineer's Guide to Logging

As a backend engineer, one of the most common questions I encounter is: "How much logging is too much?" It's a delicate balance between having enough information to debug issues and avoiding log pollution that makes debugging harder. Let me share my thoughts on finding the sweet spot for logging in production systems.

The Logging Dilemma

Logging is like seasoning in cooking - too little and you can't taste anything, too much and it ruins the dish. In software engineering, insufficient logging leaves you blind when issues occur, while excessive logging can:

  • Overwhelm your monitoring systems
  • Increase storage costs
  • Create performance bottlenecks
  • Make it harder to find relevant information
  • Potentially expose sensitive data

The Four Pillars of Effective Logging

1. Context is the Key

Why it matters: Context transforms a useless log entry into actionable intelligence. Without context, you're left playing detective with incomplete clues. When an issue occurs at 3 AM, you need to understand not just what happened, but why it happened and what circumstances led to it.

The business impact: Context-rich logs reduce mean time to resolution (MTTR) dramatically. Instead of spending hours correlating different log sources and guessing at root causes, you can immediately understand the user journey, system state, and environmental factors that contributed to the issue.

Every log entry should answer the question: "What was the system trying to do when this happened?"

# Bad - Tells you nothing useful
logger.info("User login failed")
 
# Good - Gives you everything needed to investigate
logger.info("User login failed", extra={
    "user_id": user.id,
    "ip_address": request.remote_addr,
    "user_agent": request.headers.get('User-Agent'),
    "login_attempt_count": get_login_attempts(user.id),
    "failure_reason": "invalid_password",
    "account_status": user.status,
    "last_successful_login": user.last_login_at
})

Pro tip: Include enough context so that someone unfamiliar with your system can understand what was happening. Your future self will thank you.

2. Strategic Log Levels

Why it matters: Log levels are your filtering mechanism and alert system rolled into one. They determine what gets attention, what gets stored long-term, and what gets ignored. Misusing log levels is like crying wolf - eventually, people stop paying attention to your alerts.

The operational impact: Proper log levels enable effective monitoring and alerting. ERROR logs should wake someone up at night, while DEBUG logs should help during development. When log levels are used consistently, you can set up reliable alerting thresholds and log retention policies.

Use log levels purposefully:

  • ERROR: Something broke and needs immediate attention (triggers alerts, wakes up on-call engineers)
  • WARN: Something unexpected happened but the system recovered (investigate during business hours)
  • INFO: Important business events that stakeholders care about (user actions, state changes, metrics)
  • DEBUG: Detailed information for troubleshooting (disabled in production, enabled for specific investigations)
# ERROR - System is broken
logger.error("Payment processing failed", extra={
    "payment_id": payment.id,
    "error_code": "GATEWAY_TIMEOUT",
    "amount": payment.amount
})
 
# WARN - Unexpected but handled
logger.warning("Fallback to secondary payment gateway", extra={
    "primary_gateway": "stripe",
    "fallback_gateway": "paypal",
    "reason": "rate_limit_exceeded"
})
 
# INFO - Business event
logger.info("User subscription upgraded", extra={
    "user_id": user.id,
    "from_plan": "basic",
    "to_plan": "premium"
})

3. Structure Your Logs

Why it matters: Structured logs are machine-readable, which means they're searchable, aggregatable, and analyzable at scale. Free-form text logs are like having a library with no catalog system - you might find what you're looking for eventually, but it's going to take a while.

The scalability factor: As your system grows, you'll need to analyze patterns across millions of log entries. Structured logs enable you to:

  • Query specific fields efficiently
  • Create dashboards and metrics
  • Set up automated anomaly detection
  • Correlate events across services
  • Export data to analytics platforms

Use structured logging (JSON) instead of free-form text:

# Instead of this - Hard to parse and search
logger.info(f"User {user_id} purchased {product_name} for ${amount} using {payment_method}")
 
# Do this - Machine readable and queryable
logger.info("Purchase completed", extra={
    "event": "purchase_completed",
    "user_id": user_id,
    "product_id": product_id,
    "product_name": product_name,
    "amount": amount,
    "currency": "USD",
    "payment_method": payment_method,
    "timestamp": datetime.utcnow().isoformat(),
    "session_id": session.id
})

Benefits of structured logging:

  • Easy to filter: event:purchase_completed AND amount:>100
  • Simple aggregation: Sum all purchase amounts by user
  • Pattern detection: Identify unusual payment method usage
  • Cross-service correlation: Link purchases to user behavior

4. Performance Considerations

Why it matters: Logging should never be the bottleneck in your system. Poorly implemented logging can degrade performance, increase latency, and ironically, make your system less reliable. The goal is observability without overhead.

The hidden costs: Every log statement has a cost:

  • CPU time for formatting and serialization
  • Memory allocation for log objects
  • I/O operations for writing to disk/network
  • Network bandwidth for centralized logging
  • Storage costs for log retention

Be mindful of logging performance:

# Use asynchronous logging for high-throughput systems
import logging.handlers
import queue
 
# Set up async handler
log_queue = queue.Queue()
queue_handler = logging.handlers.QueueHandler(log_queue)
logger.addHandler(queue_handler)
 
# Implement sampling for very frequent events
import random
 
def should_log_debug(sample_rate=0.01):
    return random.random() < sample_rate
 
# Only log 1% of debug events in high-traffic scenarios
if should_log_debug():
    logger.debug("Cache hit", extra={"key": cache_key})
 
# Avoid expensive operations in log statements
# Bad - Database call in every log
logger.info(f"User {get_user_name(user_id)} logged in")
 
# Good - Pass the data you already have
logger.info("User logged in", extra={
    "user_id": user_id,
    "username": user.username  # Already loaded
})

Performance best practices:

  • Use lazy evaluation for expensive log formatting
  • Implement circuit breakers for logging systems
  • Monitor logging system health separately
  • Set up log rotation and retention policies
  • Consider log sampling for high-frequency events

What to Log: The Essential Events

Business Logic Events

  • User authentication and authorization
  • Financial transactions
  • Data modifications
  • Feature usage metrics
  • Error conditions and exceptions

System Health Indicators

  • Service startup/shutdown
  • Database connection issues
  • External API calls and their responses
  • Resource usage patterns
  • Performance metrics

Security Events

  • Failed authentication attempts
  • Permission denied events
  • Suspicious activity patterns
  • Data access logs

What NOT to Log

Sensitive Information

  • Passwords, tokens, or API keys
  • Personal identifiable information (PII)
  • Credit card numbers or financial data
  • Session tokens

High-Frequency, Low-Value Events

  • Every database query in a high-traffic system
  • Successful health check pings
  • Routine background job executions

Practical Implementation Tips

1. Use Correlation IDs

Track requests across services and events:

import uuid
from contextvars import ContextVar
 
correlation_id: ContextVar[str] = ContextVar('correlation_id')
 
def log_with_correlation(message, **kwargs):
    logger.info(message, extra={
        'correlation_id': correlation_id.get(None),
        **kwargs
    })
 
# For Kafka events - use the same correlation ID for all logs related to processing the same message
def process_kafka_message(message):
    correlation_id.set(message.headers.get('correlation_id', str(uuid.uuid4())))
    logger.info("Processing Kafka message", extra={
        'event_type': 'kafka_message_received',
        'topic': message.topic,
        'partition': message.partition,
        'offset': message.offset
    })
    
    # All subsequent logs in this processing flow will use the same correlation ID
    process_business_logic(message.value)
 
# For API requests - extract or generate correlation ID at the entry point
def api_middleware(request):
    request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
    correlation_id.set(request_id)
    logger.info("API request received", extra={
        'method': request.method,
        'path': request.path,
        'user_id': getattr(request.user, 'id', None)
    })

Key principle: Use the same correlation ID for all logs related to:

  • Processing a single Kafka message
  • Handling one API request
  • Executing one script or batch job
  • Any single logical operation that spans multiple components

2. Implement Log Sampling

For high-frequency events:

import random
 
def should_log_sample(sample_rate=0.01):
    return random.random() < sample_rate
 
if should_log_sample():
    logger.debug("High frequency event occurred")

3. Use Feature Flags for Log Levels

Allow runtime adjustment without deployment:

def get_log_level():
    return config.get('LOG_LEVEL', 'INFO')
 
logger.setLevel(get_log_level())

Conclusion

There's no universal answer to "how many logs are enough" - it depends on your system's requirements, traffic patterns, and business needs. Start with logging critical business events and errors, then gradually add more detailed logging based on actual debugging needs.

Remember: logs are a tool, not an end goal. They should make your life easier, not harder. Focus on logging events that help you understand what your system is doing and why it might be failing.

The best logging strategy is one that evolves with your system. Start simple, measure the impact, and adjust based on real-world usage patterns. Your future self (and your on-call teammates) will thank you for the thoughtful approach to logging.