Deduplication of at least once subscription

leonardobonacci · January 16, 2023, 12:12am

The docs state on publishing and subscribing (Publishing and Subscribing with Actions :: Kalix Documentation): ‘Messages are guaranteed to be delivered at least once. This means that receivers must be able to handle duplicate messages.’

If we want to combine a subscription (i.e. Kafka) with an Event-Sourced entity (Implementing Event Sourced Entities in Java :: Kalix Documentation) we require exactly once semantics.

How do you recommend deduplication on the receiver site?

aludwiko · January 16, 2023, 9:24am

Hi @leonardobonacci,
Proper deduplication depends very often on your actual domain use case and can be implemented with many different flavors. A generic solution (not sure if valid for your use case) would be to have some unique id in each Kafka message. In your ES entity state, you need to keep some sort of deduplication state, like a list of the last 100X processed ids. Before you process any message, first check if such an id is present in the state, if so, simply ignore the msg and consume the next one. Of course with long-lived ES entities watch out for the deduplication list size which shouldn’t exceed the memory limits.

Another strategy would be to use a seq number for deduplication, more on that here. This approach is less memory consuming, but more tricky to get it right. Be careful, because Kafka seq number is not a good choice for deduplication in most cases.

leonardobonacci · January 16, 2023, 7:10pm

Thanks for answering @aludwiko ,

That first solution, a unique id in each Kafka message was exactly my concern in the context of event-sourcing. The different ES operations (insertOrder, confirmOrder, shipOrder, deleteOrder, or whatever) are all Kafka-keyed on the same unique entity key in order to preserve topic-order.

It would have to be another key then. True, something like a compound Kafka topic-partition-sequence should work.

And just to be 100% sure, with this ‘In your ES entity state, you need to keep some sort of deduplication state’, do you mean a separate Value Entity (state) with ids?

aludwiko · January 17, 2023, 10:00am

The different ES operations (insertOrder, confirmOrder, shipOrder, deleteOrder, or whatever) are all Kafka-keyed on the same unique entity key in order to preserve topic-order.

Yes, deduplication key/id is sth completely different from the entity key. Each message should have its own id for deduplication, which is very often the case to have a message id anyway.

True, something like a compound Kafka topic-partition-sequence should work.

Not entirely. Especially if you want to use seq num from Kafka. Just put some UUID in each message and this should be fine for the start.

separate Value Entity (state)

Nope, this must be a part of your domain aggregate, which is the consistency boundary. Otherwise, you won’t be able to achieve effectively-once delivery. Sth like:

MyEventSourcingOrder(field1: Type1, field2: Type2, .... processedIds: List[UUID])

Topic		Replies	Views
Could a subscription action create an entity twice?	2	225	April 8, 2024
Restoring a backup or replaying events of a previous instance	2	323	July 27, 2022
Bulk import data into Value entity	2	240	March 10, 2023
Protobuf To Kafka Topic protobuf	18	1077	July 27, 2022
Error KLX-00121 with a view based on a topic	3	572	July 12, 2022

Deduplication of at least once subscription

Related topics