Dedup Algorithms - Using a Flow Control to Remove Duplicates for a Document Cache

This example shows how to dedup (remove duplicates) documents that need to go into a cache. When you have duplicate entries within a cache it will cause an error when you perform a document cache lookup within a set property shape or within a map. This process is easy to understand and quick to build. The main downside to this method is processing time. Since each document is being processed individually, the time it takes to dedup substantially increases as the document count increases. The process starts off with a flat file that has duplicate lines, it is split so that multiple documents can be produced for example purposes. A flow control will process one document into the cache at a time. The decision shape will check if there is a document within the cache with a matching index value. The first value within the decision shape is to perform a document cache lookup, the comparison is set to equal to, and the second value is set to static with nothing within the field. If data is not returned, then it will be added to the cache. If data is returned, then it will be sent to the stop shape and disregarded.

Dedup Process Overview

Article originally posted at Boomi Community.