A checkpoint is a concept that identifies a point in time when your data is in a verifiable, production‐ready state. For example, a checkpoint could be considered a backup of a set of data. This data could be used in recovery scenarios to get data that has been corrupted, deleted, or delayed back into a usable and unbroken state. In the context of Azure Stream Analytics, checkpointing is performed by the platform automatically. A checkpoint is performed on each node that is processing your data to perform streaming analytics every few minutes. If a platform upgrade happens or the node experiences a failure, the data checkpoint will be used to recover and restart the job on a new node. Two concepts you about learned in Chapter 4 are relevant here: point‐in‐time restore (PITR) and recovery point objective (RPO). PITR is dependent on a data snapshot taken at a given point in time, which is useful for restoring from data loss or data corruption. RPO defines the maximum amount of time it will take to recover the lost or corrupted data. Both concepts apply in the checkpointing capabilities offered with Azure Stream Analytics, which work behind the scenes without any action required from you.
A data stream, by definition, is continuously flowing; it never stops. Combining that fact with the need to monitor, replay, or repair data within the stream requires some capabilities to achieve. One such capability is a watermark. A watermark is an indicator that marks the point at which the event message has been ingressed by the stream processor. Figure 7.40 illustrates a data stream; each vertical line within the stream represents an event message.
FIGURE 7.40 A data stream with event messages and a watermark
Each event message in the data stream has an event time and arrival time, which, as you can see, is a very precise datetime stamp. The event time is the timestamp that represents when the event message is generated by the data‐producing device and is part of the event message payload. The arrival time is the timestamp that represents when the event message reached the ingestion endpoint, for example, an event hub. Each event message in the data stream is linked to a watermark that increases by the time frame windowing configuration for the given data stream. For example, the following window is defined as 5 seconds. The watermark will be the same for all event messages ingressed into the stream pipeline within that time window, as long as the event time and arrival time fall within the same 5‐second period.
GROUP BY IngestionTime, TumblingWindow(second, 5)
The watermark is referenced for monitoring the performance of the data stream within that time window. The data stream can be replayed between two date timestamps or repaired when it is determined something unexpected happened during a given time frame. Keep in mind that this feature is managed by the platform and is abstracted away from you to a point where it is not easily observable. To better understand the watermarking concept, consider the data stream details shown in Table 7.7.
TABLE 7.7 Data stream illustration
Sequence | Arrival time | Event time | Watermark |
0 | 10:00:00 | 10:00:01 | 10:00:00 |
1 | 10:00:02 | 10:00:02 | 10:00:00 |
2 | 10:00:04 | 10:00:04 | 10:00:00 |
3 | 10:00:06 | 10:00:06 | 10:00:05 |
4 | 10:00:09 | 10:00:09 | 10:00:05 |
5 | 10:00:11 | 10:00:10 | 10:00:10 |
6 | 10:00:11 | 10:00:05 | 10:00:10 |
7 | 10:00:13 | 10:00:13 | 10:00:10 |
8 | 10:00:15 | 10:00:15 | 10:00:15 |
9 | 10:00:17 | 10:00:16 | 10:00:15 |