Published on December 6, 2025

Split Microservices: A Real-World Case Study

cuongkane

@cuongkane

Split Microservices: A Real-World Case Study

Microservices architecture has become the gold standard for building scalable systems, but knowing when and how to split services is crucial for success. This post explores the strategic decisions behind microservice decomposition through a real-world case study of splitting an Alert service from an Analytics platform.

The Microservices Question: Why Split?

Before diving into the technical details, let's address the fundamental question: Why do we need microservices in the first place?

The Benefits of Microservice Architecture

1. Improved Fault Isolation When a large monolithic service fails, it can bring down your entire system. Microservices contain failures, preventing cascading outages that can devastate user experience.

2. Facilitated Continuous Delivery Smaller services enable more frequent deployments with reduced risk. Teams can ship features independently without coordinating massive releases.

3. Technology Flexibility Large monolithic services create technology lock-in. Microservices allow teams to choose the best tools for specific problems and experiment with new frameworks without system-wide impact.

4. Clear Ownership and Responsibility Well-defined service boundaries aligned with business domains enable teams to take full ownership of specific system areas, improving accountability and expertise.

When Should You Split a Microservice?

The decision to split isn't just about size - it's about identifying the right boundaries. Here are key indicators:

Domain Boundary Clarity

When you can clearly separate business domains with minimal cross-cutting concerns, it's often a good candidate for splitting. In our case study, Alert functionality had distinct business logic separate from Analytics reporting.

Team Ownership Patterns

If different teams are working on different parts of the same service and experiencing coordination overhead, splitting might reduce friction.

Deployment Independence

When parts of your system need to be deployed at different cadences or have different risk profiles, separation becomes valuable.

Performance Characteristics

Different workloads may have vastly different performance requirements. Alerts need real-time processing, while analytics can tolerate batch processing delays.

Case Study: Splitting Alert Service from Analytics Platform

Let's examine a real-world example of splitting an Alert service from a larger Analytics platform, demonstrating the complexity and considerations involved.

The Problem Setup

The Alert and Analytics features were initially co-located in the same repository, creating several challenges:

Maintenance overhead: Different teams working on the same codebase
Deployment coupling: Analytics changes affecting Alert stability
Technology constraints: Framework decisions affecting both domains
Unclear ownership: Shared responsibility leading to gaps in expertise

System Architecture Overview

The Alert module consisted of three main components:

API Server: REST endpoints for creating, updating, and managing alerts
Background Processing: Periodic evaluation and notification system
Event Consumer: Message queue consumer for configuration management

The Migration Strategy: Zero-Downtime Decomposition

The key to successful microservice splitting is maintaining system availability throughout the process. Here's the strategic approach we used:

Deployment Philosophy

Tight Coordination Window: Deployments 1-3 must occur on the same day to minimize code duplication and reduce the maintenance burden of running parallel systems.

Incremental Migration: Each deployment phase builds upon the previous one, allowing for validation and rollback at each step.

Phase 1: API Replication

Objective: Create parallel API endpoints without disrupting existing functionality.

Steps:

Infrastructure Setup: Create new environment configurations and secrets management
Code Replication: Deploy identical Alert APIs to new service domain
Monitoring Setup: Establish health checks and alerting for the new service
Traffic Switch: Update frontend to consume new API endpoints

Key Insight: Start with the user-facing layer first. This allows you to validate the new service with real traffic while maintaining the ability to quickly rollback.

Phase 2: Event Processing Migration

Objective: Move message queue consumer functionality to the new service.

Critical Challenge: Message queue consumer groups maintain offset state. Switching consumers requires careful offset management to prevent message loss or duplication.

Solution: Create new consumer group starting from the last unconsumed offset of the old consumer.

Switching with script:

  from confluent_kafka import Consumer as KafkaConsumer, TopicPartition
 
  def setup_kafka_consumer_with_offset(
      start_offset: int, broker_url: str, topic_name: str, consumer_group: str,
  ):
      kafka_client = KafkaConsumer({
          "bootstrap.servers": broker_url,
          "group.id": consumer_group,
          "enable.auto.commit": False,
      })
 
      partition_list = kafka_client.list_topics(topic_name).topics[topic_name].partitions.keys()
      print(f"Topic {topic_name} has {len(partition_list)} partitions: {partition_list}")
 
      assigned_partitions = [TopicPartition(topic_name, pid, start_offset) for pid in partition_list]
      kafka_client.assign(assigned_partitions)
      print(f"Assigned partitions to consumer {consumer_group}")
 
      for partition in assigned_partitions:
          kafka_client.seek(partition)
      print(f"Seeked all partitions to offset {start_offset}")
 
      message = kafka_client.poll(1.0)
      if message is None:
          print(f"No message from topic {topic_name} at offset {start_offset}")
      elif message.error():
          print(f"Consumer error: {message.error()}")
      else:
          kafka_client.commit()
          print(f"Committed message from partition {message.partition()} at offset {message.offset()}")
          print(f"Successfully created consumer group {consumer_group}")
 
      kafka_client.close()
      print(f"Closed consumer {consumer_group}")

Switching with tool: Can use Kafka Web UI tools (i.e RedPanda if they support)

Phase 3: Background Processing Migration

Objective: Migrate background processing system for alert evaluation and notifications.

Complexity: Background processing systems require careful coordination between job schedulers and worker processes.

Approach: Replicate the processing logic while maintaining the existing scheduler infrastructure, then switch processing workspaces atomically.

Phase 4: Data Migration

The Most Critical Phase: Database migration should be the last step. Database migration requires extreme care to maintain data consistency and system availability.

We have two main strategies for database migration:

Strategy 1: Backup and Restore (Minimal Downtime)

Steps:

Database Preparation: Create new database with identical schema
Service Preparation: Configure service to use new database credentials
Traffic Halting: Temporarily disable write operations (return 503 for non-GET requests and disable event consumer and background jobs)
Data Export/Import: Use database backup and restore tools for consistent data transfer
Service Switch: Enable new database and restore full functionality

Downtime Minimization: By halting only write operations and maintaining read access, user experience impact is minimized during the brief migration window.

Strategy 2: Dual-Write with Replica (Zero Downtime)

Objective: Achieve true zero-downtime migration by duplicating writes to both databases during the transition period.

Prerequisites: Requires DevOps team collaboration for database replication setup.

Steps:

Replica Setup: DevOps team creates a replica of the source database in the target environment
Dual-Write Implementation: Modify application code to write to both original and replica databases
Data Consistency Verification: Implement monitoring to ensure write operations succeed on both databases
Read Traffic Switch: Gradually shift read operations to the new database while maintaining dual writes
Write Traffic Switch: Once read operations are stable, switch write operations to target only the new database
Cleanup: Remove dual-write logic and decommission the original database

Implementation Example:

class DualWriteRepository:
    def __init__(self, primary_db, replica_db, migration_mode=True):
        self.primary_db = primary_db
        self.replica_db = replica_db
        self.migration_mode = migration_mode
        
    def create_alert(self, alert_data):
        # Always write to primary first
        result = self.primary_db.create_alert(alert_data)
        
        if self.migration_mode:
            try:
                # Duplicate write to replica
                self.replica_db.create_alert(alert_data)
            except Exception as e:
                # Log error but don't fail the operation
                logger.error(f"Replica write failed: {e}")
                # Could implement retry logic or dead letter queue
                
        return result
        
    def update_alert(self, alert_id, update_data):
        result = self.primary_db.update_alert(alert_id, update_data)
        
        if self.migration_mode:
            try:
                self.replica_db.update_alert(alert_id, update_data)
            except Exception as e:
                logger.error(f"Replica update failed: {e}")
                
        return result

Advantages of Dual-Write Approach:

True Zero Downtime: No service interruption during migration
Gradual Migration: Can test and validate new database with real traffic
Easy Rollback: Can switch back to original database instantly if issues arise
Data Validation: Opportunity to verify data consistency before full cutover

Challenges and Considerations:

Code Complexity: Requires implementing dual-write logic throughout the application
Error Handling: Must handle cases where writes succeed on one database but fail on another
Data Consistency: Need monitoring to ensure both databases stay in sync
Performance Impact: Dual writes can increase latency and resource usage
DevOps Coordination: Requires close collaboration with infrastructure team for replica setup

Monitoring During Dual-Write Phase:

class MigrationMonitor:
    def __init__(self, metrics_client):
        self.metrics = metrics_client
        
    def track_dual_write_success(self, operation_type):
        self.metrics.increment(f'migration.dual_write.success.{operation_type}')
        
    def track_dual_write_failure(self, operation_type, database):
        self.metrics.increment(f'migration.dual_write.failure.{operation_type}.{database}')
        
    def track_data_consistency_check(self, consistent: bool):
        status = 'consistent' if consistent else 'inconsistent'
        self.metrics.increment(f'migration.data_consistency.{status}')

Recommendation: Use Strategy 2 (Dual-Write) for production systems where downtime is not acceptable. Use Strategy 1 (Backup/Restore) for systems that can tolerate brief maintenance windows.

Phase 5: Cleanup

Objective: Remove deprecated code and infrastructure.

Importance: Cleanup is often overlooked but crucial for:

Reducing technical debt
Eliminating confusion about data sources
Preventing accidental dependencies on old systems

Key Takeaways for Microservice Splitting

1. Plan for Data Consistency

Data migration is often the most complex part of service splitting. Plan for:

Schema compatibility during transition periods
Data consistency verification methods
Rollback procedures for data operations

2. Maintain Service Contracts

APIs should remain stable during migration. Changes to service contracts can cascade throughout your system.

3. Monitor Everything

Comprehensive monitoring during migration helps identify issues before they impact users:

Application metrics (latency, error rates)
Infrastructure metrics (CPU, memory, network)
Business metrics (feature usage, conversion rates)

4. Have a Rollback Plan

Every phase should have a clear rollback procedure. Document:

Decision criteria for rollback
Step-by-step rollback procedures
Data recovery methods

5. Coordinate Team Communication

Microservice splitting affects multiple teams. Establish:

Clear communication channels
Deployment coordination protocols
Escalation procedures for issues

Conclusion

Splitting microservices is a complex undertaking that requires careful planning, coordination, and execution. The benefits - improved fault isolation, deployment flexibility, and team autonomy - make it worthwhile when done correctly.

Remember that microservice architecture is not a destination but a journey. Start with clear business boundaries, plan for data consistency, maintain service contracts, and always have a rollback plan.

The key to successful microservice decomposition lies not in the technical implementation details, but in understanding your business domains and designing systems that reflect those boundaries. When done right, microservices enable teams to move faster and build more resilient systems.

Final thought: Don't split microservices because it's trendy - split them because it solves real problems in your organization. The complexity cost is real, but when justified by business needs, the benefits far outweigh the challenges.

See all Blogs