For the latest stable version, please use StreamX Guides 1.0.1! |
Stateful computation in StreamX
Many streaming applications require processing incoming data based on previously processed data. The primary challenge lies in managing the continuous influx of messages, which must be considered during the processing of others. In distributed systems like StreamX Mesh, messages can be processed by different instances or functions within a single instance. StreamX’s event-driven nature requires a mechanism to perform actions based on historical data. Each StreamX service reacts to messages from an incoming channel and, if necessary, stores them for further processing. When a message is consumed, StreamX ensures that it is available in all the storage required by the service. The consuming method has access to the state, which reflects any Store's
channel.
Example application that requires stateful computation might be a sitemap generator. Different CMS system produces pages that are persisted in sitemap generator’s store. The scheduled job generates sitemap.xml
based on the stored pages and sends it to web server.
Persistent log Store
Streaming applications with multiple instances need access to a single source of truth when working with historical data. While traditional databases might seem like the obvious choice, StreamX requires more than just data persistence. In addition to storing the data, it ensures that messages are processed in the correct order and that persisted messages are available until processing starts on the current message. The persistent log fulfills these requirements. When a message arrives, it is automatically written to the end of the persistent log and registered for processing. Processing only begins after the message has been read from the log by the service. This ensures that all previous messages have been persisted and are ready for use. Each replica processes only the messages it has registered. This design allows StreamX to handle a vast number of messages in parallel.
The Store
Store is an interface that provides three core functionalities in a StreamX service: state, message ordering, and message synchronization. The Store
is backed by a local store that keeps a copy of the data from the persistent log. The local copy can be stored either in memory or on disk, and each service instance maintains its own local store for performance reasons.
State of the channel
Any message sent to the incoming channel can be placed directly into the Store
or into a modified version if a message converter is used. It remains there until a newer message arrives. This data represents the channel’s history and can be accessed whenever the business logic requires it.
Message ordering
When a Store
is declared in a service, it ensures that any outdated message sent to the channel is skipped. A message is considered outdated if a message with the same key
and a more recent eventTime
has already been processed. If the eventTime
is equal, the message is still processed.
In the above diagram messages with the same key are marked with the same color. Time is 'T1, T2, T3'. For each message key only the latest ones are accepted for further processing.
Message synchronization
The Store
enables message synchronization. This mechanism ensures that when a message is processed, all prior messages sent by any replica are already present in the Store
. Each incoming message is read by a single service instance and written to the persistent log. The message remains in the service queue until read from the log. Other instances send their messages to the log simultaneously. The persistent log ensures that messages are written in order and that, by the time a message is processed, all preceding messages are available in the local store of each service instance. By default, services wait for messages sent to the Store
channel, but you can choose how channels are synchronized. See the available synchronization modes.