An upsert in the stream processing context is the same as in other data analytics scenarios, such as batch or pipeline transformation processing. The action performed when executing an upsert is either an update or an insert. If the data currently being processed already exists in the datastore, an update is performed on that dataset. If the data does not already exist, then it is inserted into the datastore. The questions are, when in the stream processing pipeline can data be upserted, and where is the data stored so that the upsert can be performed? The answer to those questions, like in most cases, depends on the requirements for the stream processing solution. Consider the lambda architecture hot path, which feeds data to a speed layer and serving layer (refer to Figure 3.13). Data flowing along the speed layer that is consumed in real time has a very short window in which it can be upserted. Consider Figure 7.25, where an input table persists data in memory. You can use the processing that happens between that input table and the results table to perform the logic required to execute an upsert. Because the data is persisted into a datastore like Azure Cosmos DB, ADLS, or an Azure Synapse Analytics SQP pool along the serving layer, performing an upsert as the data arrives is possible prior to its consumption by downstream clients. The following sections discuss additional details about upserts specific to Azure Stream Analytics and Azure Databricks.

Azure Stream Analytics

The best solution for performing upserts using Azure Stream Analytics is to place them into a persisted datastore, perform the upsert there, and then make them available for consumption. This approach aligns best with the serving layer component of the lambda architecture. Numerous products can be used to persist the streamed data so that an upsert can be performed. A few of them are covered here.

Katie Cox

Learn More

Leave a Reply

Your email address will not be published. Required fields are marked *