Dedup Algorithms - Using a Map and Flat File Profile to Remove Duplicates

This process demonstrates how to dedup data with a map and a flat file profile that is set to enforce uniqueness. Within the process, the first branch removes duplicates from a full set of data based on a single index (or field/element). The second branch takes this concept a little farther, and removes duplicate index values, concatenates the values with a comma delimit, removes the last comma, and the data at the end of branch 2 can be used to perform a query that uses a comma delimited string.

The first method (branch 1) of removing duplicates starts by converting XML into a flat file. Within this map, 3 documents go in and 3 documents come out. The flat file profile has enforce uniqueness set on the Id element. This setting applies only to a whole document and does not compare individual documents. Therefore, next we use a data process shape to combine the documents into one document. The next map will result in the enforce uniqueness setting to be applied, and all duplicates with then be removed from the data set.

The second method (branch 2) uses a similar concept as branch 1. It confirms the xml data to multiple flat files. The file flats only have the Id element and is not a full set of the data. The data gets combined so that the enforce uniqueness setting will get a applied in the second map. In the data process shape after the second map, the line returns will be replaced with commas, and the comma at the end of the line will be removed. This method can be extremely helpful in perform queries against Salesforce, NetSuite, database, or other endpoints that allows for multiple values to be concatenated together.

Dedup Process Overview

Article originally posted at Boomi Community.