SSIS Remove Duplicate Rows Using Fuzzy Grouping

Source data may have the duplicate rows which needs to be removed as part of data cleansing task.
Fuzzy Grouping is one of the transformation in Data Flow Transformations which can be used to group the similar rows in the source stream line. This transformation performs data cleaning tasks by identifying rows of data that are likely to be duplicate.
Lets generate some of duplicate records as shown in below screen:-
Source Query
Create one DataFlow task inside which Drag OLE DB Source component and write the query shown above which will become as source data which has duplicate records.
 OLEDBSource
Drag the Fuzzy Group component and open the editor.
Go to the Columns tab, and check the checkbox agist the column which you want to analyze of similarity.
FuzzyGroupParticipatingColumn
after that go to the advanced tab. Here we can define Similarity threshold.
SimilarityThreshold
Important values in aboe screenshot are _key_in and _key_out in our example as we are removing duplicate records.
 Lets for temporary use drag the Union All component and add dataviewer control to look for the data after passing through FuzzyGroup Transformation.
In below screen, we can see the result.
FuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutputFuzzyGroupOutput
So we can see in above screen , Key_in and Key_out has same value for unique rows. Hence we can take only those records which have Key_in=Key_out.
Remove the Union all and add conditional split transformation to get the unique rows as shown in below screen:-
ConditionalSplit
after that to check that drag Unionall component and add data viewer and run the task.
After running the data flow task it will show below screen:-
FinalResult
Hence in above result we can see Unique rows.

Comments

Popular posts from this blog

SharePoint 2007 - Simple Task Dashboard

MERGE transformation in SSIS