All Articles

Dedup Algorithms - Using a Data Shape (Combine and Split) to Remove Duplicates in a Document Cache

This example shows how to dedup (remove duplicates) documents that need to go into a cache. While this process does not truly remove the duplicates, it does modify the data in such a way that the document cache will not recognize the data as duplicates. When you have duplicate entries within a cache it will cause an error when you perform a document cache lookup within a set property shape or within a map. The main benefit to this process is that it is easy to understand and quick to build. The method has little downside except for the time required to convert all of the documents to a flat file format (map shape). It is an efficient way to dedup for a document cache. Also by limiting the amount of data within the cache it can reduce the load on memory and the amount of data that needs to be written or read from the disk. The process works by taking multiple documents (not flat files) and converting them to a flat file. If you were already working with flat files, then only the data process shape would be needed. After the map the data process shape will combine the flat file into one document. Then the data process shape will split the document on the same values within the index of the cache. This will result in all the duplicates being within one flat file document. When pulling data from the cache only the first line will be used. This method ultimately prevent duplicates because of how Boomi treats flat files. If the cache has multiple keys within the index in the cache, then the data process shape will need to perform a split on each element that is a key within the index.

dedup process overview

Article originally posted at Boomi Community.

Published Jun 5, 2021

Developing a better world.© All rights reserved.