Deduplication of at least once subscription

The docs state on publishing and subscribing (Publishing and Subscribing with Actions :: Kalix Documentation): ‘Messages are guaranteed to be delivered at least once. This means that receivers must be able to handle duplicate messages.’

If we want to combine a subscription (i.e. Kafka) with an Event-Sourced entity (Implementing Event Sourced Entities in Java :: Kalix Documentation) we require exactly once semantics.

How do you recommend deduplication on the receiver site?

Hi @leonardobonacci,
Proper deduplication depends very often on your actual domain use case and can be implemented with many different flavors. A generic solution (not sure if valid for your use case) would be to have some unique id in each Kafka message. In your ES entity state, you need to keep some sort of deduplication state, like a list of the last 100X processed ids. Before you process any message, first check if such an id is present in the state, if so, simply ignore the msg and consume the next one. Of course with long-lived ES entities watch out for the deduplication list size which shouldn’t exceed the memory limits.

Another strategy would be to use a seq number for deduplication, more on that here. This approach is less memory consuming, but more tricky to get it right. Be careful, because Kafka seq number is not a good choice for deduplication in most cases.

Thanks for answering @aludwiko ,

That first solution, a unique id in each Kafka message was exactly my concern in the context of event-sourcing. The different ES operations (insertOrder, confirmOrder, shipOrder, deleteOrder, or whatever) are all Kafka-keyed on the same unique entity key in order to preserve topic-order.

It would have to be another key then. True, something like a compound Kafka topic-partition-sequence should work.

And just to be 100% sure, with this ‘In your ES entity state, you need to keep some sort of deduplication state’, do you mean a separate Value Entity (state) with ids?

The different ES operations (insertOrder, confirmOrder, shipOrder, deleteOrder, or whatever) are all Kafka-keyed on the same unique entity key in order to preserve topic-order.

Yes, deduplication key/id is sth completely different from the entity key. Each message should have its own id for deduplication, which is very often the case to have a message id anyway.

True, something like a compound Kafka topic-partition-sequence should work.

Not entirely. Especially if you want to use seq num from Kafka. Just put some UUID in each message and this should be fine for the start.

separate Value Entity (state)

Nope, this must be a part of your domain aggregate, which is the consistency boundary. Otherwise, you won’t be able to achieve effectively-once delivery. Sth like:

MyEventSourcingOrder(field1: Type1, field2: Type2, .... processedIds: List[UUID])
1 Like