This is part 4 of the “EDA for the Rest of Us” series. Previously, we covered Event Design Patterns (how to structure events) and Communication Patterns (how events flow between services). Now we’re tackling the consumption side - where those beautifully designed events meet the harsh reality of production systems.

Remember that perfectly designed event you published? The one with just the right amount of data, using the ideal communication pattern? Well, it’s about to hit reality: consumers that can’t keep up, services drowning in irrelevant events, and that one team whose Lambda function processes everything three times “just to be safe.”

Welcome to consumption patterns - where theory meets the cold, hard truth of production systems.

Table of Contents

The Consumption Challenge

Here’s what becomes clear in the trenches: producing events is the easy part. The real complexity lives on the consumption side, where you face questions like:

  • How do you scale processing when traffic increases 50x during peak events?
  • How do you filter events efficiently when 90% are irrelevant to your service?
  • How do you handle backpressure when your consumers can’t keep up?
  • How do you ensure events are processed exactly once (spoiler: you probably can’t)?

Most teams discover these challenges through experience - when event backlogs grow into millions of messages or when processing costs exceed projections.

But here’s the thing: these patterns aren’t mutually exclusive. Real production systems layer multiple consumption patterns together. You’ll filter events to reduce noise, use competing consumers to scale processing, implement consumer groups for different teams, and add backpressure handling to protect everything from meltdown. Think of these patterns as tools in a toolkit - you’ll often use several at once to build a consumption strategy that actually survives production.

The Reality of Distributed Consumption

Before diving into patterns, we need to talk about the two things that break everyone’s assumptions: duplicate delivery and out-of-order processing. These aren’t edge cases - they’re the default behavior.

Duplicate Delivery: The “At-Least-Once” Reality

AWS services provide “at-least-once” delivery. That innocent-sounding phrase means: every event will be delivered one or more times. Not “might be” - WILL BE. Here’s why duplicates are guaranteed:

  • Network timeouts: SQS delivers message, Lambda processes it, but acknowledgment fails. SQS delivers it again.
  • Visibility timeout expires: Lambda takes too long, message becomes visible, another Lambda grabs it.
  • Consumer failures: Lambda crashes after processing but before acknowledging. Message returns to queue.
  • Retry mechanisms: DLQ redrive, manual replays, error recovery all create duplicates.

This actually happened to us: Your “charge-payment” event runs twice. Without protection, you’ve just double-charged a customer. Your “send-welcome-email” event runs three times. Customer gets three emails and thinks you’re spam.

The Ordering Challenge

Events will arrive out of order. Not sometimes. Always. Here’s why:

  • Parallel processing: Competing consumers process events simultaneously
  • Retry delays: Failed events get processed after newer ones
  • Network latency: Events take different paths through the infrastructure
  • Multiple producers: Different services publish related events at different times

Picture this chaos: “order-shipped” → “payment-failed” → “order-created” → “order-cancelled” → “payment-succeeded”

Living with Chaos

Here’s the kicker - this isn’t some edge case that happens once a month. This is Tuesday. Your consumer might see: “order-shipped” → “payment-failed” → “order-created” → “order-cancelled” → “payment-succeeded”

Take a moment to think about that. The order shipped before it was created. The payment failed and succeeded. How’s your state machine feeling?

You can’t eliminate these challenges, but you can design for them:

For Duplicate Delivery (Idempotency):

  • Use natural business keys (order ID, not auto-increment)
  • Make operations idempotent by design (SET status = ‘shipped’, not status++)
  • Track processed events (simple cache or database table)
  • Use conditional writes (“update only if version = X”)

⚠️ Warning: Idempotency is Hard

The suggestions above are starting points, not complete solutions. Distributed idempotency is one of the hardest problems in event-driven systems. That “simple cache” will fail when TTLs expire. That “database table” will race when multiple Lambdas check simultaneously. True idempotency often requires:

  • Event sourcing patterns with immutable logs
  • Distributed locking (with all its problems)
  • Saga orchestration with compensation logic
  • Or accepting that some duplicates will slip through

Don’t underestimate this complexity. Many teams spend months getting idempotency right, only to discover edge cases in production at the worst possible time.

For Ordering:

  • Design events to be self-contained (include current state, not just deltas)
  • Use event sourcing patterns where order matters critically
  • Implement saga patterns for multi-step processes
  • Accept eventual consistency where possible

These challenges affect every pattern you’ll implement:

  • Event Filtering: Can hide duplicate events or deliver them to different consumers
  • Competing Consumers: Guarantees out-of-order processing AND duplicates
  • Consumer Groups: Each group gets their own duplicates at different times
  • Backpressure: Makes both problems worse during high load
  • Poison Message Handling: Must handle duplicates of poison messages and out-of-order DLQ processing

Here’s what took me three production incidents to learn: Don’t fight it - design for it. Every consumer must handle duplicates and out-of-order events. This isn’t a bug in your architecture; it’s a fundamental property of distributed systems.

Alright, so we’ve accepted that our events will arrive drunk and disorderly. What patterns can actually help us deal with this mess?

Pattern 1: Event Filtering

Event Filtering selectively routes or processes events based on specific criteria, content, or metadata. Like a sophisticated mail sorting system that only delivers relevant mail to each recipient, filtering prevents consumers from being overwhelmed with irrelevant events. Filtering can occur at the broker level (server-side) before delivery to reduce network traffic and consumer processing load, or at the consumer level (client-side) for more complex filtering logic. Modern event brokers support sophisticated filtering including content-based filtering (examining event payload), attribute-based filtering (examining metadata), and pattern matching using wildcards or regular expressions.

graph LR
    P[Event Producer] --> B[Event Broker]
    B --> F1[Filter: Order > $1000]
    B --> F2[Filter: Region = EU]
    B --> F3[Filter: Priority = High]
    F1 --> C1[High-Value Order Service]
    F2 --> C2[EU Compliance Service]
    F3 --> C3[Priority Handler]

    style F1 fill:#f9f,stroke:#333,stroke-width:2px
    style F2 fill:#f9f,stroke:#333,stroke-width:2px
    style F3 fill:#f9f,stroke:#333,stroke-width:2px

When Event Filtering Shines?

  • High event volumes with low relevance rates - When services only care about 5-10% of events
  • Multi-tenant systems - Each tenant’s services only process their own events
  • Compliance boundaries - EU services only process EU data, healthcare services only see HIPAA-compliant events
  • Cost optimization - Reduce Lambda invocations by 90%+ by filtering at the infrastructure level

Trade-offs

  • Filtering saves money - often 90%+ on Lambda costs. But infrastructure filters are pretty dumb. They can’t handle “if customer.tier == ’enterprise’ AND order.value > 1000”. You’ll still need code for the fancy stuff.
  • Your code gets cleaner when it only sees relevant events. The downside? Good luck debugging why an event never showed up. “Was it filtered? Dropped? Never sent?” Welcome to distributed systems debugging.
  • Security win: Services never see data they shouldn’t access. Just remember that changing filter rules means touching infrastructure, not just pushing code.
  • Here’s what nobody mentions: One wrong filter rule can silently drop critical events. No errors, no alerts, just… nothing. And there’s no undo button.

AWS Implementation

  • Primary: Amazon EventBridge for content-based routing and filtering with sophisticated pattern matching
  • Alternative: Amazon SNS with message filtering policies for simpler attribute-based filtering
  • High-throughput: Amazon Kinesis Analytics for real-time stream filtering and processing

Pattern 2: Competing Consumers

Competing Consumers enables multiple consumer instances to process messages from the same message queue or topic partition, with each message delivered to exactly one consumer. This creates competition among consumers for messages, like a team of workers taking tasks from a shared task queue, ensuring each task goes to only one worker, but having more workers increases overall processing speed. The messaging system ensures each message is delivered to only one consumer, preventing duplicate processing while enabling horizontal scaling. The pattern handles consumer failures gracefully through visibility timeouts or acknowledgment mechanisms that make messages from failed consumers available to other consumers.

graph LR
    P[Event Producer] --> Q[Message Queue]
    Q --> C1[Consumer 1]
    Q --> C2[Consumer 2]
    Q --> C3[Consumer 3]
    Q --> C4[Consumer N...]

    C1 --> D[Database]
    C2 --> D
    C3 --> D
    C4 --> D

    style Q fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#faa,stroke:#333,stroke-width:2px

When Competing Consumers Shines?

  • Variable load patterns - Black Friday vs. normal Tuesday
  • CPU-intensive processing - Image processing, ML inference, report generation
  • Downstream API rate limits - Spreading load across time to respect limits
  • Resilience requirements - One consumer failure doesn’t stop processing

Trade-offs

  • Scaling becomes trivial - just add more consumers. Queue handles the distribution. But kiss message ordering goodbye - your events are now in whatever order they finish processing.
  • The queue manages everything - no coordination headaches, automatic failover. You just lose control over which consumer gets what. Sometimes Consumer 3 processes all the easy messages while Consumer 1 struggles with the hard ones.
  • Failed messages magically reappear for another consumer to try. Except now you’re guaranteed duplicates. Every. Single. Consumer. Must handle them.
  • You’ve solved the Lambda bottleneck! …and immediately hit your database connection limit. Congrats, you’ve just moved the problem downstream.

AWS Implementation

  • Primary: Amazon SQS with Lambda Event Source Mapping for serverless auto-scaling consumers
  • Container-based: Amazon ECS/Fargate with Application Auto Scaling based on queue metrics
  • Simple: Amazon SQS with EC2 Auto Scaling Groups for traditional architectures

Pattern 3: Consumer Groups

Consumer groups enable multiple independent consumers to read from the same event stream, each maintaining their own position and processing rate. Like multiple people reading the same newspaper, each consumer group can process at its own pace without affecting others.

graph TD
    subgraph "Event Stream"
        E1[Event 1] --> E2[Event 2]
        E2 --> E3[Event 3]
        E3 --> E4[Event 4]
        E4 --> E5[Event N...]
    end

    subgraph "Analytics Group"
        E1 --> A1[Position: Event 2]
        A1 --> AS[Analytics Service]
    end

    subgraph "Billing Group"
        E1 --> B1[Position: Event 4]
        B1 --> BS[Billing Service]
    end

    subgraph "Audit Group"
        E1 --> AU1[Position: Event 1]
        AU1 --> AUS[Audit Service]
    end

    style E1 fill:#f9f,stroke:#333,stroke-width:2px
    style E2 fill:#f9f,stroke:#333,stroke-width:2px
    style E3 fill:#f9f,stroke:#333,stroke-width:2px

When Consumer Groups Shines?

  • Multiple processing speeds - Real-time fraud detection vs. hourly analytics
  • Different retention needs - Audit needs 7 years, monitoring needs 7 days
  • Independent scaling - Each group scales based on its workload
  • Replay scenarios - Reprocess historical events without affecting other consumers

Trade-offs

  • Each group gets independence - their own position, their own speed. But now you’re managing N groups with N different positions, N different lag metrics, and N different ways things can go wrong.
  • Keep events forever! Well, as long as you’re willing to pay for it. Storage costs add up fast when every consumer group needs their own retention period.
  • No slow consumer can block fast ones - great for isolation. But your total throughput is still capped by partition count. More groups don’t mean more speed.
  • Replay whenever you want - fantastic for fixing bugs in production. Just remember there’s no coordination. Your audit service might be processing last week while billing is on today.

AWS Implementation

  • Primary: Amazon Kinesis Data Streams with enhanced fan-out for dedicated throughput per consumer
  • Complex patterns: Amazon MSK (Managed Kafka) for unlimited consumer groups and exactly-once semantics
  • Simplified: Amazon EventBridge Pipes to create filtered consumer groups from Kinesis or DynamoDB streams

💰 Cost Warning: Kinesis consumer groups can cost 10-20x the base shard price with enhanced fan-out. Five consumer groups processing 1TB/month can add $500+/month on top of shard costs. Always calculate enhanced fan-out pricing before choosing Kinesis for multiple consumers.

Pattern 4: Backpressure Handling

Backpressure handling protects consumers from being overwhelmed by controlling the flow of events. It acts as a pressure release valve, ensuring that temporary spikes don’t cascade into system-wide failures.

graph LR
    subgraph "High Load"
        P1[Producer 1] --> B[Event Broker/Queue]
        P2[Producer 2] --> B
        P3[Producer N...] --> B
    end

    B --> RC[Rate Controller]

    subgraph "Protected Consumers"
        RC --> C1[Consumer 1]
        RC --> C2[Consumer 2]
    end

    subgraph "Backpressure Signals"
        C1 --> M1[Queue Depth]
        C2 --> M2[Error Rate]
        M1 --> RC
        M2 --> RC
    end

    RC --> T[Throttle/Scale]

    style B fill:#f9f,stroke:#333,stroke-width:2px
    style RC fill:#faa,stroke:#333,stroke-width:2px

When It Shines

  • Flash sales and viral events - 100x normal traffic spikes
  • Batch job processing - Nightly dumps of millions of records
  • API rate limits - Third-party services with strict quotas
  • Database protection - Preventing connection pool exhaustion

Trade-offs

  • Your system stays up during chaos - cascade failures become a thing of the past. But those events are piling up somewhere, and processing delays can stretch from seconds to hours.
  • Downstream services breathe easy - no more connection pool exhaustion or rate limit breaches. The cost? You need queues and buffers everywhere, eating memory and storage.
  • Self-healing is beautiful - when load drops, everything catches up automatically. Until you realize you need monitoring at every level to know what’s actually happening.
  • Graceful degradation beats hard failures - but those long queues might hit retention limits. Nothing like losing events because they sat in a queue for 15 days.

AWS Implementation

  • Primary: Amazon SQS as a shock absorber with 14-day retention and unlimited queue depth
  • Rate limiting: Lambda reserved concurrency to protect downstream services
  • Auto-scaling: Application Auto Scaling with queue depth metrics for dynamic capacity

🚧 Note: This is a simplified starting point focused on consumer-side protection. True backpressure includes signaling producers to slow down, but many teams don’t control their producers (third-party webhooks, other teams’ services). Real implementations often require database connection pooling, complex API rate limits per customer tier, and distributed rate limiting across regions. If you control both producer and consumer, add producer throttling based on queue depth or consumer lag metrics. Start here, but expect to evolve based on your actual bottlenecks.

Pattern 5: Poison Message Handling

Poison message handling isolates and manages messages that consistently fail processing, preventing them from blocking your entire event pipeline. Like a quarantine system in a hospital, it identifies “sick” messages and moves them aside so healthy messages can continue flowing while you diagnose the problem.

graph LR
    Q[Message Queue] --> C[Consumer]
    C --> P{Process<br/>Success?}

    P -->|Yes| ACK[Acknowledge]
    P -->|No| RC{Retry<br/>Count?}

    RC -->| less than Max | RQ[Return to Queue]
    RC -->| greater than Max | DLQ[Dead Letter Queue]

    RQ --> Q

    DLQ --> A{Analyze}
    A -->|Schema Issue| FIX1[Schema Fix]
    A -->|Data Issue| FIX2[Data Cleanup]
    A -->|Bug| FIX3[Code Fix]

    FIX1 --> RED[Redrive]
    FIX2 --> RED
    FIX3 --> RED
    RED --> Q

    style DLQ fill:#ff6b6b,stroke:#333,stroke-width:2px

When It Shines

  • Schema evolution breaks - New event version breaks old consumers
  • Data quality issues - Malformed JSON, missing required fields, encoding problems
  • Business rule violations - Orders for discontinued products, payments exceeding limits
  • Integration failures - Third-party API changes, unexpected response formats
  • Bug manifestation - That edge case you didn’t test appearing in production

Trade-offs

  • Good messages keep flowing - one bad apple doesn’t spoil the bunch. But related messages might process out of order or skip their broken sibling entirely.
  • Problem messages get spotlight treatment - easy to find, debug, and fix. Now you need a whole separate monitoring stack for your DLQ. More dashboards, more alerts.
  • Batch fix and replay - fix the bug, redrive the queue, problem solved! Unless those messages are now 3 days old and the world has moved on.
  • You can build smart routing - schema errors here, business violations there. Great until your automatic recovery starts cascading failures faster than humans can respond.

AWS Implementation

  • Primary: Amazon SQS with Dead Letter Queues - set maxReceiveCount to 3-5
  • Simple retry: Lambda with exponential backoff for transient failures
  • Visibility: X-Ray tracing to identify poison message patterns
  • Manual review: S3 bucket for messages requiring human investigation

⚡ DLQ Limitations

Your DLQ is not a parking lot - it’s an emergency room. Messages don’t get better by sitting there. Without active monitoring:

  • Messages expire after 14 days (silently!)
  • Related messages process out of order
  • Customers complain before you notice

Treat every DLQ message as a mini-incident. Set alarms, have runbooks, and review daily.

Choosing the Right Pattern

The key to successful event consumption isn’t implementing every pattern - it’s knowing how to combine patterns to solve real challenges. These patterns layer together like building blocks:

These patterns layer together like building blocks. Start with SQS+Lambda and poison message handling (every system needs DLQ from day one), then add patterns as you hit specific challenges:

flowchart LR
    Start[Start:<br/>SQS+Lambda] --> Q1{Irrelevant<br/>events?>50%}
    Q1 -->|Yes| F1[Event<br/>Filtering]
    Q1 -->|No| Q2
    F1 --> Q2{Lambda<br/>throttling?}
    Q2 -->|Yes| C1[Competing<br/>Consumers]
    Q2 -->|No| Q3
    C1 --> Q3{Multiple<br/>teams?}
    Q3 -->|Yes| G1[Consumer<br/>Groups]
    Q3 -->|No| Next[→ Continue<br/>Row 2]
    G1 --> Next

    style Start fill:#90EE90
    style F1 fill:#FFB6C1
    style C1 fill:#FFB6C1
    style G1 fill:#FFB6C1
    style Next fill:#e0e0e0
flowchart LR
    From[From<br/>Row 1 →] --> Q4{Downstream<br/>failing?}
    Q4 -->|Yes| B1[Back-<br/>pressure]
    Q4 -->|No| Q5
    B1 --> Q5{Errors<br/>>1%?}
    Q5 -->|Yes| P1[Poison<br/>Handling]
    Q5 -->|No| OK[Done ✓]
    P1 --> OK

    style From fill:#e0e0e0
    style B1 fill:#FFB6C1
    style P1 fill:#FFB6C1
    style OK fill:#90EE90
Pattern Primary Use Case Key Indicator AWS Service Complexity
Event Filtering Reduce irrelevant processing <10% events are relevant EventBridge, SNS Low
Competing Consumers Scale processing horizontally Single consumer at capacity SQS + Lambda Medium
Consumer Groups Multiple independent readers Different teams/processing speeds Kinesis, MSK High
Backpressure Handling Protect from overload 10x+ load spikes SQS + Auto Scaling Medium
Poison Message Handling Isolate failing messages Messages fail repeatedly SQS + DLQ Low

Cost Considerations

Pattern Saves Adds Monthly Impact*
Event Filtering Lambda invocations ($0.20/1M)
Lambda compute time
EventBridge rules ($1/1M events)
SNS message filtering (free)
10M events: Save ~$150/mo
Competing Consumers Sequential processing time
Timeout failures
SQS requests ($0.40/1M)
More Lambda concurrency
10M messages: Add ~$4/mo
Consumer Groups Separate infrastructure per team
Duplicate processing
Kinesis shards ($15/shard/mo)
MSK ($70+/broker/mo)
3 shards: Add $45/mo
3 brokers: Add $210/mo
Backpressure Handling Downstream API overage charges
Database connection costs
Extended message retention
Monitoring/scaling overhead
Minimal if using SQS
Poison Message Handling System-wide outages
Manual incident response
DLQ storage
Monitoring infrastructure
Negligible storage
Huge operational savings

*Rough estimates for 10M events/month workload

💡 Total Cost of Ownership

These infrastructure costs are often a rounding error compared to operational complexity:

  • Filter misconfiguration: One dropped customer order can cost more than a year of EventBridge rules
  • Debugging time: “Why didn’t my event arrive?” across 50 services = days of engineering time
  • Filter sprawl: We’ve seen teams with 2,000+ EventBridge rules that nobody fully understands
  • Incident response: That 3 AM page when idempotency fails and customers get charged twice

The real cost isn’t in AWS bills - it’s in engineering time, customer trust, and system complexity. Design for operability, not just infrastructure costs.

Common Pattern Combinations

High-Volume SaaS Platform

Started with: Lambda + SNS (worked until 10K events/hour)

  • Hit enterprise import surge, Lambda concurrency exhausted
  • Added: Competing Consumers (SQS for load smoothing)
  • Hit 95% irrelevant tenant events burning compute
  • Added: Event Filtering (EventBridge by tenant ID)
  • Hit database connection limits during bulk operations
  • Added: Backpressure Handling (reserved concurrency)

Final: Handles 1M+ events/hour across 10K tenants

Multi-Team Data Platform

Started with: Single Kinesis stream, 3 teams

  • Hit replay conflicts between teams
  • Added: Consumer Groups (independent positions)
  • Hit cost explosion as 50+ teams read everything
  • Added: Event Filtering (EventBridge Pipes per group)
  • Hit peak hour processing bottlenecks
  • Added: Competing Consumers (per team scaling)

Final: Each team processes only relevant events at their own pace

E-Commerce Order Processing

Started with: SQS queues per service

  • Black Friday: fulfillment crashed, analytics sat idle
  • Added: Consumer Groups (independent processing)
  • Next year: EU centers processing US orders
  • Added: Event Filtering (regional separation)
  • Flash sale overwhelmed payment provider
  • Added: Backpressure Handling (circuit breakers)

Final: 100x traffic spikes handled smoothly

Financial Transaction Processing

Started with: Basic Kinesis streaming

  • Audit: “Where’s data isolation?” → Added: Event Filtering
  • Audit: “Separate compliance trails?” → Added: Consumer Groups
  • Audit: “Processing bottleneck?” → Added: Competing Consumers
  • Audit: “Market volatility handling?” → Added: Backpressure Handling

Final: All four patterns working together, millions of transactions, 30 countries

The Path Forward

Consumption patterns are where your event-driven architecture proves its worth - or falls apart. Remember: these patterns work together, not in isolation. Start simple and layer patterns as needed:

  1. Add filtering first - It’s the easiest win with immediate cost benefits
  2. Use managed scaling - Let SQS + Lambda handle concurrency for you
  3. Plan for failure - Every consumer needs a DLQ and retry strategy
  4. Monitor everything - You can’t optimize what you don’t measure
  5. Combine patterns thoughtfully - Most production systems use 2-3 patterns together

Your payment processor might start with just SQS competing consumers. As you grow, you’ll add filtering to reduce noise, then backpressure handling for flash sales. That’s not scope creep - it’s natural evolution.

Next up in the series: Operational Patterns - because building it is only half the battle. We’ll cover event replay, distributed tracing, and why your event schemas keep breaking production at the worst possible times.

Until then, may your queues be shallow and your consumers forever scaling.


What consumption challenges are you facing? Which pattern combinations have saved your production systems? Let me know in the comments below.

References

AWS Documentation

  1. Amazon EventBridge Event Filtering - Content-based filtering patterns and rules
  2. Amazon SQS Best Practices - Queue configuration and scaling guidance
  3. Lambda Event Source Mappings - Configuring Lambda consumers for SQS, Kinesis, and DynamoDB
  4. Kinesis Data Streams Consumers - Enhanced fan-out and consumer groups
  5. Amazon MSK Consumer Groups - Kafka consumer group management

Architectural Patterns

  1. Enterprise Integration Patterns - Foundational messaging patterns
  2. Competing Consumers Pattern - Microsoft’s pattern documentation
  3. Event-Driven Architecture Pattern - AWS’s EDA overview and best practices
  4. Idempotent Consumer Pattern - Handling duplicate messages

Distributed Systems Fundamentals

  1. The Log: What every software engineer should know - Jay Kreps on distributed logs
  2. Designing Data-Intensive Applications - Martin Kleppmann’s book (Chapter 11: Stream Processing)
  3. Building Event-Driven Microservices - Adam Bellemare’s comprehensive guide

AWS Architecture Guides

  1. AWS Lambda Operator Guide - Production best practices and patterns
  2. AWS Well-Architected Framework - Reliability Pillar - Workload architecture best practices
  3. Amazon SQS and AWS Lambda - Integration patterns and configuration

Cost Optimization

  1. EventBridge Pricing - Understanding rule and event costs
  2. Kinesis Pricing Calculator - Shard hour calculations
  3. AWS Lambda Pricing - Request and compute pricing model

Real-World Case Studies

  1. Netflix’s Event-Driven Architecture - Scaling event sourcing
  2. Uber’s Event-Driven Architecture - Handling billions of events