This is part 4 of the “EDA for the Rest of Us” series. Previously, we covered Event Design Patterns (how to structure events) and Communication Patterns (how events flow between services). Now we’re tackling the consumption side - where those beautifully designed events meet the harsh reality of production systems.
Remember that perfectly designed event you published? The one with just the right amount of data, using the ideal communication pattern? Well, it’s about to hit reality: consumers that can’t keep up, services drowning in irrelevant events, and that one team whose Lambda function processes everything three times “just to be safe.”
Welcome to consumption patterns - where theory meets the cold, hard truth of production systems.
Table of Contents
- The Consumption Challenge
- The Reality of Distributed Consumption
- Pattern 1: Event Filtering
- Pattern 2: Competing Consumers
- Pattern 3: Consumer Groups
- Pattern 4: Backpressure Handling
- Pattern 5: Poison Message Handling
- Choosing the Right Pattern
- The Path Forward
- References
The Consumption Challenge
Here’s what becomes clear in the trenches: producing events is the easy part. The real complexity lives on the consumption side, where you face questions like:
- How do you scale processing when traffic increases 50x during peak events?
- How do you filter events efficiently when 90% are irrelevant to your service?
- How do you handle backpressure when your consumers can’t keep up?
- How do you ensure events are processed exactly once (spoiler: you probably can’t)?
Most teams discover these challenges through experience - when event backlogs grow into millions of messages or when processing costs exceed projections.
But here’s the thing: these patterns aren’t mutually exclusive. Real production systems layer multiple consumption patterns together. You’ll filter events to reduce noise, use competing consumers to scale processing, implement consumer groups for different teams, and add backpressure handling to protect everything from meltdown. Think of these patterns as tools in a toolkit - you’ll often use several at once to build a robust consumption strategy.
The Reality of Distributed Consumption
Before you choose any pattern, let’s address two guarantees in distributed systems that break most people’s assumptions: duplicate delivery and out-of-order processing. These aren’t edge cases - they’re the default behavior.
Duplicate Delivery: The “At-Least-Once” Reality
AWS services provide “at-least-once” delivery. That innocent-sounding phrase means: every event will be delivered one or more times. Not “might be” - WILL BE. Here’s why duplicates are guaranteed:
- Network timeouts: SQS delivers message, Lambda processes it, but acknowledgment fails. SQS delivers it again.
- Visibility timeout expires: Lambda takes too long, message becomes visible, another Lambda grabs it.
- Consumer failures: Lambda crashes after processing but before acknowledging. Message returns to queue.
- Retry mechanisms: DLQ redrive, manual replays, error recovery all create duplicates.
Real scenario: Your “charge-payment” event runs twice. Without protection, you’ve just double-charged a customer. Your “send-welcome-email” event runs three times. Customer gets three emails and thinks you’re spam.
The Ordering Challenge
Events will arrive out of order. Not sometimes. Always. Here’s why:
- Parallel processing: Competing consumers process events simultaneously
- Retry delays: Failed events get processed after newer ones
- Network latency: Events take different paths through the infrastructure
- Multiple producers: Different services publish related events at different times
Your consumer might see: “order-shipped” → “payment-failed” → “order-created” → “order-cancelled” → “payment-succeeded”
Living with Chaos
You can’t eliminate these challenges, but you can design for them:
For Duplicate Delivery (Idempotency):
- Use natural business keys (order ID, not auto-increment)
- Make operations idempotent by design (SET status = ‘shipped’, not status++)
- Track processed events (simple cache or database table)
- Use conditional writes (“update only if version = X”)
⚠️ Warning: Idempotency is Hard
The suggestions above are starting points, not complete solutions. Distributed idempotency is one of the hardest problems in event-driven systems. That “simple cache” will fail when TTLs expire. That “database table” will race when multiple Lambdas check simultaneously. True idempotency often requires:
- Event sourcing patterns with immutable logs
- Distributed locking (with all its problems)
- Saga orchestration with compensation logic
- Or accepting that some duplicates will slip through
Don’t underestimate this complexity. Many teams spend months getting idempotency right, only to discover edge cases in production at the worst possible time.
For Ordering:
- Design events to be self-contained (include current state, not just deltas)
- Use event sourcing patterns where order matters critically
- Implement saga patterns for multi-step processes
- Accept eventual consistency where possible
These challenges affect every pattern you’ll implement:
- Event Filtering: Can hide duplicate events or deliver them to different consumers
- Competing Consumers: Guarantees out-of-order processing AND duplicates
- Consumer Groups: Each group gets their own duplicates at different times
- Backpressure: Makes both problems worse during high load
- Poison Message Handling: Must handle duplicates of poison messages and out-of-order DLQ processing
The key insight? Don’t fight it - design for it. Every consumer must handle duplicates and out-of-order events. This isn’t a bug in your architecture; it’s a fundamental property of distributed systems.
Now let’s look at the patterns themselves, keeping these constraints in mind.
Pattern 1: Event Filtering
Event Filtering selectively routes or processes events based on specific criteria, content, or metadata. Like a sophisticated mail sorting system that only delivers relevant mail to each recipient, filtering prevents consumers from being overwhelmed with irrelevant events. Filtering can occur at the broker level (server-side) before delivery to reduce network traffic and consumer processing load, or at the consumer level (client-side) for more complex filtering logic. Modern event brokers support sophisticated filtering including content-based filtering (examining event payload), attribute-based filtering (examining metadata), and pattern matching using wildcards or regular expressions.
graph LR
P[Event Producer] --> B[Event Broker]
B --> F1[Filter: Order > $1000]
B --> F2[Filter: Region = EU]
B --> F3[Filter: Priority = High]
F1 --> C1[High-Value Order Service]
F2 --> C2[EU Compliance Service]
F3 --> C3[Priority Handler]
style F1 fill:#f9f,stroke:#333,stroke-width:2px
style F2 fill:#f9f,stroke:#333,stroke-width:2px
style F3 fill:#f9f,stroke:#333,stroke-width:2px
When Event Filtering Shines?
- High event volumes with low relevance rates - When services only care about 5-10% of events
- Multi-tenant systems - Each tenant’s services only process their own events
- Compliance boundaries - EU services only process EU data, healthcare services only see HIPAA-compliant events
- Cost optimization - Reduce Lambda invocations by 90%+ by filtering at the infrastructure level
Trade-offs
- Cost vs Complexity: Filtering can reduce costs by 90%+ by only processing relevant events, but infrastructure-level filters are limited to simple rules - complex business logic still requires application-level filtering
- Performance vs Visibility: Services run faster processing only relevant events with cleaner code, but debugging why events didn’t reach consumers becomes significantly harder
- Security vs Flexibility: Filtering ensures services never see data they shouldn’t access, but changing filter criteria requires infrastructure updates rather than simple code changes
- Efficiency vs Risk: Irrelevant events are dropped early saving resources, but incorrect filter configuration can silently drop important events with no easy recovery path
AWS Implementation
- Primary: Amazon EventBridge for content-based routing and filtering with sophisticated pattern matching
- Alternative: Amazon SNS with message filtering policies for simpler attribute-based filtering
- High-throughput: Amazon Kinesis Analytics for real-time stream filtering and processing
Pattern 2: Competing Consumers
Competing Consumers enables multiple consumer instances to process messages from the same message queue or topic partition, with each message delivered to exactly one consumer. This creates competition among consumers for messages, like a team of workers taking tasks from a shared task queue, ensuring each task goes to only one worker, but having more workers increases overall processing speed. The messaging system ensures each message is delivered to only one consumer, preventing duplicate processing while enabling horizontal scaling. The pattern handles consumer failures gracefully through visibility timeouts or acknowledgment mechanisms that make messages from failed consumers available to other consumers.
graph LR
P[Event Producer] --> Q[Message Queue]
Q --> C1[Consumer 1]
Q --> C2[Consumer 2]
Q --> C3[Consumer 3]
Q --> C4[Consumer N...]
C1 --> D[Database]
C2 --> D
C3 --> D
C4 --> D
style Q fill:#f9f,stroke:#333,stroke-width:2px
style D fill:#faa,stroke:#333,stroke-width:2px
When Competing Consumers Shines?
- Variable load patterns - Black Friday vs. normal Tuesday
- CPU-intensive processing - Image processing, ML inference, report generation
- Downstream API rate limits - Spreading load across time to respect limits
- Resilience requirements - One consumer failure doesn’t stop processing
Trade-offs
- Scalability vs Ordering: Automatic load distribution across consumers enables easy horizontal scaling, but parallel processing completely breaks message ordering guarantees
- Simplicity vs Control: Queue-based distribution requires no coordination overhead and provides built-in fault tolerance, but offers limited control over which consumer processes which message
- Resilience vs Duplication: Failed messages automatically return to the queue for reprocessing, but at-least-once delivery means every consumer must handle duplicate messages
- Frontend vs Backend Bottlenecks: Adding consumers easily handles increased load, but often just moves the bottleneck to databases or downstream APIs
AWS Implementation
- Primary: Amazon SQS with Lambda Event Source Mapping for serverless auto-scaling consumers
- Container-based: Amazon ECS/Fargate with Application Auto Scaling based on queue metrics
- Simple: Amazon SQS with EC2 Auto Scaling Groups for traditional architectures
Pattern 3: Consumer Groups
Consumer groups enable multiple independent consumers to read from the same event stream, each maintaining their own position and processing rate. Like multiple people reading the same newspaper, each consumer group can process at its own pace without affecting others.
graph TD
subgraph "Event Stream"
E1[Event 1] --> E2[Event 2]
E2 --> E3[Event 3]
E3 --> E4[Event 4]
E4 --> E5[Event N...]
end
subgraph "Analytics Group"
E1 --> A1[Position: Event 2]
A1 --> AS[Analytics Service]
end
subgraph "Billing Group"
E1 --> B1[Position: Event 4]
B1 --> BS[Billing Service]
end
subgraph "Audit Group"
E1 --> AU1[Position: Event 1]
AU1 --> AUS[Audit Service]
end
style E1 fill:#f9f,stroke:#333,stroke-width:2px
style E2 fill:#f9f,stroke:#333,stroke-width:2px
style E3 fill:#f9f,stroke:#333,stroke-width:2px
When Consumer Groups Shines?
- Multiple processing speeds - Real-time fraud detection vs. hourly analytics
- Different retention needs - Audit needs 7 years, monitoring needs 7 days
- Independent scaling - Each group scales based on its workload
- Replay scenarios - Reprocess historical events without affecting other consumers
Trade-offs
- Independence vs Complexity: Each consumer group maintains its own position and processing speed, but managing multiple groups adds significant operational overhead
- Flexibility vs Cost: Longer retention enables replay from any point in history, but storing events for all consumer groups increases storage costs substantially
- Isolation vs Throughput: Workload isolation prevents slow consumers from impacting fast ones, but total throughput is limited by the stream’s partition count
- Replay vs Coordination: Each group can independently reprocess historical events, but there’s no built-in support for coordinated processing across groups
AWS Implementation
- Primary: Amazon Kinesis Data Streams with enhanced fan-out for dedicated throughput per consumer
- Complex patterns: Amazon MSK (Managed Kafka) for unlimited consumer groups and exactly-once semantics
- Simplified: Amazon EventBridge Pipes to create filtered consumer groups from Kinesis or DynamoDB streams
💰 Cost Warning: Kinesis consumer groups can cost 10-20x the base shard price with enhanced fan-out. Five consumer groups processing 1TB/month can add $500+/month on top of shard costs. Always calculate enhanced fan-out pricing before choosing Kinesis for multiple consumers.
Pattern 4: Backpressure Handling
Backpressure handling protects consumers from being overwhelmed by controlling the flow of events. It acts as a pressure release valve, ensuring that temporary spikes don’t cascade into system-wide failures.
graph LR
subgraph "High Load"
P1[Producer 1] --> B[Event Broker/Queue]
P2[Producer 2] --> B
P3[Producer N...] --> B
end
B --> RC[Rate Controller]
subgraph "Protected Consumers"
RC --> C1[Consumer 1]
RC --> C2[Consumer 2]
end
subgraph "Backpressure Signals"
C1 --> M1[Queue Depth]
C2 --> M2[Error Rate]
M1 --> RC
M2 --> RC
end
RC --> T[Throttle/Scale]
style B fill:#f9f,stroke:#333,stroke-width:2px
style RC fill:#faa,stroke:#333,stroke-width:2px
When It Shines
- Flash sales and viral events - 100x normal traffic spikes
- Batch job processing - Nightly dumps of millions of records
- API rate limits - Third-party services with strict quotas
- Database protection - Preventing connection pool exhaustion
Trade-offs
- Stability vs Latency: Protects system from cascade failures and maintains quality under load, but events queue up during spikes causing processing delays
- Protection vs Resources: Prevents overwhelming downstream services and controls costs, but buffers and queues consume memory and storage
- Automation vs Monitoring: Self-heals when load decreases without intervention, but requires complex multi-level monitoring and alerting
- Graceful vs Timely: Enables graceful degradation during overload, but long queues risk exceeding retention periods and message timeouts
AWS Implementation
- Primary: Amazon SQS as a shock absorber with 14-day retention and unlimited queue depth
- Rate limiting: Lambda reserved concurrency to protect downstream services
- Auto-scaling: Application Auto Scaling with queue depth metrics for dynamic capacity
🚧 Note: This is a simplified starting point focused on consumer-side protection. True backpressure includes signaling producers to slow down, but many teams don’t control their producers (third-party webhooks, other teams’ services). Real implementations often require database connection pooling, complex API rate limits per customer tier, and distributed rate limiting across regions. If you control both producer and consumer, add producer throttling based on queue depth or consumer lag metrics. Start here, but expect to evolve based on your actual bottlenecks.
Pattern 5: Poison Message Handling
Poison message handling isolates and manages messages that consistently fail processing, preventing them from blocking your entire event pipeline. Like a quarantine system in a hospital, it identifies “sick” messages and moves them aside so healthy messages can continue flowing while you diagnose the problem.
graph LR
Q[Message Queue] --> C[Consumer]
C --> P{Process<br/>Success?}
P -->|Yes| ACK[Acknowledge]
P -->|No| RC{Retry<br/>Count?}
RC -->| less than Max | RQ[Return to Queue]
RC -->| greater than Max | DLQ[Dead Letter Queue]
RQ --> Q
DLQ --> A{Analyze}
A -->|Schema Issue| FIX1[Schema Fix]
A -->|Data Issue| FIX2[Data Cleanup]
A -->|Bug| FIX3[Code Fix]
FIX1 --> RED[Redrive]
FIX2 --> RED
FIX3 --> RED
RED --> Q
style DLQ fill:#ff6b6b,stroke:#333,stroke-width:2px
When It Shines
- Schema evolution breaks - New event version breaks old consumers
- Data quality issues - Malformed JSON, missing required fields, encoding problems
- Business rule violations - Orders for discontinued products, payments exceeding limits
- Integration failures - Third-party API changes, unexpected response formats
- Bug manifestation - That edge case you didn’t test appearing in production
Trade-offs
- Availability vs Correctness: System continues processing good messages while isolating bad ones, but you might process out-of-order or skip related messages
- Visibility vs Complexity: Clear segregation of problem messages aids debugging, but requires separate monitoring, alerting, and tooling infrastructure
- Recovery vs Risk: Enables batch fixing and reprocessing of failed messages, but redriving old messages can cause unexpected side effects
- Automation vs Safety: Can automatically categorize and route different failure types, but automated recovery of poison messages risks cascading failures
AWS Implementation
- Primary: Amazon SQS with Dead Letter Queues - set maxReceiveCount to 3-5
- Simple retry: Lambda with exponential backoff for transient failures
- Visibility: X-Ray tracing to identify poison message patterns
- Manual review: S3 bucket for messages requiring human investigation
⚡ DLQ Limitations
Your DLQ is not a parking lot - it’s an emergency room. Messages don’t get better by sitting there. Without active monitoring:
- Messages expire after 14 days (silently!)
- Related messages process out of order
- Customers complain before you notice
Treat every DLQ message as a mini-incident. Set alarms, have runbooks, and review daily.
Choosing the Right Pattern
The key to successful event consumption isn’t implementing every pattern - it’s knowing how to combine patterns to solve real challenges. These patterns layer together like building blocks:
These patterns layer together like building blocks. Start with SQS+Lambda and poison message handling (every system needs DLQ from day one), then add patterns as you hit specific challenges:
flowchart LR
Start[Start:<br/>SQS+Lambda] --> Q1{Irrelevant<br/>events?>50%}
Q1 -->|Yes| F1[Event<br/>Filtering]
Q1 -->|No| Q2
F1 --> Q2{Lambda<br/>throttling?}
Q2 -->|Yes| C1[Competing<br/>Consumers]
Q2 -->|No| Q3
C1 --> Q3{Multiple<br/>teams?}
Q3 -->|Yes| G1[Consumer<br/>Groups]
Q3 -->|No| Next[→ Continue<br/>Row 2]
G1 --> Next
style Start fill:#90EE90
style F1 fill:#FFB6C1
style C1 fill:#FFB6C1
style G1 fill:#FFB6C1
style Next fill:#e0e0e0
flowchart LR
From[From<br/>Row 1 →] --> Q4{Downstream<br/>failing?}
Q4 -->|Yes| B1[Back-<br/>pressure]
Q4 -->|No| Q5
B1 --> Q5{Errors<br/>>1%?}
Q5 -->|Yes| P1[Poison<br/>Handling]
Q5 -->|No| OK[Done ✓]
P1 --> OK
style From fill:#e0e0e0
style B1 fill:#FFB6C1
style P1 fill:#FFB6C1
style OK fill:#90EE90
Pattern | Primary Use Case | Key Indicator | AWS Service | Complexity |
---|---|---|---|---|
Event Filtering | Reduce irrelevant processing | <10% events are relevant | EventBridge, SNS | Low |
Competing Consumers | Scale processing horizontally | Single consumer at capacity | SQS + Lambda | Medium |
Consumer Groups | Multiple independent readers | Different teams/processing speeds | Kinesis, MSK | High |
Backpressure Handling | Protect from overload | 10x+ load spikes | SQS + Auto Scaling | Medium |
Poison Message Handling | Isolate failing messages | Messages fail repeatedly | SQS + DLQ | Low |
Cost Considerations
Pattern | Saves | Adds | Monthly Impact* |
---|---|---|---|
Event Filtering | Lambda invocations ($0.20/1M) Lambda compute time |
EventBridge rules ($1/1M events) SNS message filtering (free) |
10M events: Save ~$150/mo |
Competing Consumers | Sequential processing time Timeout failures |
SQS requests ($0.40/1M) More Lambda concurrency |
10M messages: Add ~$4/mo |
Consumer Groups | Separate infrastructure per team Duplicate processing |
Kinesis shards ($15/shard/mo) MSK ($70+/broker/mo) |
3 shards: Add $45/mo 3 brokers: Add $210/mo |
Backpressure Handling | Downstream API overage charges Database connection costs |
Extended message retention Monitoring/scaling overhead |
Minimal if using SQS |
Poison Message Handling | System-wide outages Manual incident response |
DLQ storage Monitoring infrastructure |
Negligible storage Huge operational savings |
*Rough estimates for 10M events/month workload
💡 Total Cost of Ownership
These infrastructure costs are often a rounding error compared to operational complexity:
- Filter misconfiguration: One dropped customer order can cost more than a year of EventBridge rules
- Debugging time: “Why didn’t my event arrive?” across 50 services = days of engineering time
- Filter sprawl: We’ve seen teams with 2,000+ EventBridge rules that nobody fully understands
- Incident response: That 3 AM page when idempotency fails and customers get charged twice
The real cost isn’t in AWS bills - it’s in engineering time, customer trust, and system complexity. Design for operability, not just infrastructure costs.
Common Pattern Combinations
High-Volume SaaS Platform
Started with: Lambda + SNS (worked until 10K events/hour)
- Hit enterprise import surge, Lambda concurrency exhausted
- Added: Competing Consumers (SQS for load smoothing)
- Hit 95% irrelevant tenant events burning compute
- Added: Event Filtering (EventBridge by tenant ID)
- Hit database connection limits during bulk operations
- Added: Backpressure Handling (reserved concurrency)
Final: Handles 1M+ events/hour across 10K tenants
Multi-Team Data Platform
Started with: Single Kinesis stream, 3 teams
- Hit replay conflicts between teams
- Added: Consumer Groups (independent positions)
- Hit cost explosion as 50+ teams read everything
- Added: Event Filtering (EventBridge Pipes per group)
- Hit peak hour processing bottlenecks
- Added: Competing Consumers (per team scaling)
Final: Each team processes only relevant events at their own pace
E-Commerce Order Processing
Started with: SQS queues per service
- Black Friday: fulfillment crashed, analytics sat idle
- Added: Consumer Groups (independent processing)
- Next year: EU centers processing US orders
- Added: Event Filtering (regional separation)
- Flash sale overwhelmed payment provider
- Added: Backpressure Handling (circuit breakers)
Final: 100x traffic spikes handled smoothly
Financial Transaction Processing
Started with: Basic Kinesis streaming
- Audit: “Where’s data isolation?” → Added: Event Filtering
- Audit: “Separate compliance trails?” → Added: Consumer Groups
- Audit: “Processing bottleneck?” → Added: Competing Consumers
- Audit: “Market volatility handling?” → Added: Backpressure Handling
Final: All four patterns working together, millions of transactions, 30 countries
The Path Forward
Consumption patterns are where your event-driven architecture proves its worth - or falls apart. Remember: these patterns work together, not in isolation. Start simple and layer patterns as needed:
- Add filtering first - It’s the easiest win with immediate cost benefits
- Use managed scaling - Let SQS + Lambda handle concurrency for you
- Plan for failure - Every consumer needs a DLQ and retry strategy
- Monitor everything - You can’t optimize what you don’t measure
- Combine patterns thoughtfully - Most production systems use 2-3 patterns together
Your payment processor might start with just SQS competing consumers. As you grow, you’ll add filtering to reduce noise, then backpressure handling for flash sales. That’s not scope creep - it’s natural evolution.
Next up in the series: Operational Patterns - because building it is only half the battle. We’ll cover event replay, distributed tracing, and why your event schemas keep breaking production at the worst possible times.
Until then, may your queues be shallow and your consumers forever scaling.
What consumption challenges are you facing? Which pattern combinations have saved your production systems? Let me know in the comments below.
References
AWS Documentation
- Amazon EventBridge Event Filtering - Content-based filtering patterns and rules
- Amazon SQS Best Practices - Queue configuration and scaling guidance
- Lambda Event Source Mappings - Configuring Lambda consumers for SQS, Kinesis, and DynamoDB
- Kinesis Data Streams Consumers - Enhanced fan-out and consumer groups
- Amazon MSK Consumer Groups - Kafka consumer group management
Architectural Patterns
- Enterprise Integration Patterns - Foundational messaging patterns
- Competing Consumers Pattern - Microsoft’s pattern documentation
- Event-Driven Architecture Pattern - AWS’s EDA overview and best practices
- Idempotent Consumer Pattern - Handling duplicate messages
Distributed Systems Fundamentals
- The Log: What every software engineer should know - Jay Kreps on distributed logs
- Designing Data-Intensive Applications - Martin Kleppmann’s book (Chapter 11: Stream Processing)
- Building Event-Driven Microservices - Adam Bellemare’s comprehensive guide
AWS Architecture Guides
- AWS Lambda Operator Guide - Production best practices and patterns
- AWS Well-Architected Framework - Reliability Pillar - Workload architecture best practices
- Amazon SQS and AWS Lambda - Integration patterns and configuration
Cost Optimization
- EventBridge Pricing - Understanding rule and event costs
- Kinesis Pricing Calculator - Shard hour calculations
- AWS Lambda Pricing - Request and compute pricing model
Real-World Case Studies
- Netflix’s Event-Driven Architecture - Scaling event sourcing
- Uber’s Event-Driven Architecture - Handling billions of events