Friday, May 29, 2015

SSIS Remove Duplicate Rows Using Fuzzy Grouping (SSIS Transformations)

SSIS Remove Duplicate Rows Using Fuzzy Grouping (SSIS Transformations)


Source data may have the duplicate rows which needs to be removed as part of data cleansing task.
Fuzzy Grouping is one of the transformation in Data Flow Transformations which can be used to group the similar rows in the source stream line. This transformation performs data cleaning tasks by identifying rows of data that are likely to be duplicate.
Lets generate some of duplicate records as shown in below screen:-
Source Query
Create one DataFlow task inside which Drag OLE DB Source component and write the query shown above which will become as source data which has duplicate records.
 OLEDBSource
Drag the Fuzzy Group component and open the editor.
Go to the Columns tab, and check the checkbox agist the column which you want to analyze of similarity.
FuzzyGroupParticipatingColumn
after that go to the advanced tab. Here we can define Similarity threshold.
SimilarityThreshold
Important values in aboe screenshot are _key_in and _key_out in our example as we are removing duplicate records.
 Lets for temporary use drag the Union All component and add dataviewer control to look for the data after passing through FuzzyGroup Transformation.
In below screen, we can see the result.
FuzzyGroupOutput
So we can see in above screen , Key_in and Key_out has same value for unique rows. Hence we can take only those records which have Key_in=Key_out.
Remove the Union all and add conditional split transformation to get the unique rows as shown in below screen:-
ConditionalSplit
after that to check that drag Unionall component and add data viewer and run the task.
After running the data flow task it will show below screen:-
FinalResult
Hence in above result we can see Unique rows.

Fuzzy Grouping Transformation Sample

 Fuzzy Grouping Transformation

Fuzzy Grouping Transformation

Use of fuzzy grouping
I wanted to learn, how to use Fuzzy grouping transformation. But I was not getting any scenario, where this transformation could be used. Then while going through Google search, I got some scenario in which this transformation can be used. The scenarios are as follows:
·         Scenario1: Say there are various clothes shops in a market. Some sales man sells cloths in shops. One wants to get the salesman, who has sold maximum clothes. For this the records of sale for each salesman in shop1 and shop2 is as follows:
Shop1:                                                                  shop2:               

        But there are some names, which misspelled .For getting the total sale of each salesman, we will need to use fuzzy grouping, which will group the names which are similar and will assign a common name to the similar names, so after that we can aggregate the sales of each salesman by that common name.
For example; jayant and jayanath are similar, in which one of name misspelled. So the total sale of jayant (jayanth ) is 107.
·         Another scenario can be, when country name is misspelled or if company names misspelled etc.
Implementation of Fuzzy Grouping Transformation
Here I am going to implement the first scenario1, in which we will be doing fuzzy grouping of shop1 and shop2 by Name, and will calculate the total sales made by each salesman. So the result should look like
For implementation, I have followed below steps:
Step1: Add a DFT in your package
Step 2: For source, I have used one excel connection for shop2, and one Oledb connection for Shop1.
I did this, so that we can make this scenario a bit difficult, and we could learn some more.
Before using OleDB connection, let’s create the table and insert the data in SSMS, as below

For using Excel connection insert data of shop2 into one excel sheet of an excelfile

Now we will add Oledb source and Excel source in the DFT, and will configure both

       

Step3:  Union all records of both the shops
As we want to union the records of both the shops, so that we can apply fuzzy grouping on all the records of both shops.
For this, we need to use ‘Union All’ Transformation. But for doing Union, both the source should have same data type for their columns. For making the data type of the mapping columns similar, I am using Derived Column Transformation for both the source.
Now we will be typecasting column Sales of both the shops into one data type(in Derived column transformation) for this double click on derived column transformation, and then add one derived column say ‘sales1’ and set expression , as shown below:
Where (DT_R8)sales means, we are type casting sales column of each to DT_R8 data type, which is nothing but double precision float data type. This is one for both source’s column sales.
Now, we can apply on both the shops, Union All editor looks like below:

Step 4: Add Fuzzy grouping transformation and double click on it and go ‘Coulmn’ tab
                               
In the columns tab,
check the checkbox on left of the column, for applying fuzzy matching on that column, here  I have check the checkbox of name, that means the fuzzy matching will be applied on name column, as you can see match type as ‘Fuzzy’. 
Check the checkbox on right of the column (ie; Pass through), for using those columns in your result. As here I have checked column sales, because in the result we want to get total sale of each sales man , so this column will also be used.
In the Advanced tab
We can set the similarity threshold, token delimiters, etc.
·         Similarity threshold as 0.80: that means only matches which are at least 80% similar to the searched will only be displayed.
·         Token Delimiter: This is default set, as shown below. The delimiters below will break each sentence into multiple words if they found the delimiters, and then the matching will be done.
·         _key_in, a column that uniquely identifies each row.
·         _key_out, a column that identifies a group of duplicate rows. The _key_out column has the value of the _key_in column in the canonical data row. Rows with the same value in _key_out are part of the same group. The _key_outvalue for a group corresponds to the value of _key_in in the canonical data row.
·         _score, a value between 0 and 1 that indicates the similarity of the input row to the canonical row.



Step 5: Add Aggregate transformation, Excel Destination
Aggregate transformation is used for getting the aggregate sum of sale per person.
Before that add a ‘Data Viewer’ on the connector of Fuzzy grouping and Aggregate transformation. So that we can see some result from fuzzy grouping.
Configure the Aggregate transformation
Apply operation sum on Sales column, and Group by on Name_Clean. That means, sales will be sum by Name _Clean. Where Name_Clean is the column generated by fuzzy transformation, along with name, s ales etc., which contains value of name which are part of the same group. Rows with the same value in _key_out are part of the same group.
Step 6: Execute DFT.
After Fuzzy grouping, one Data viewer is used, which gets popup while execution. Here in the Data view we can see that, Name_Clean is Jayanth for both ‘jayant’ and ‘jayanth’. Similarly Name_Clean is ‘sidhart’ for both ‘sidharth’ and ‘sidhart’.

When we continue execution, then one more data viewer is in the connector between Aggregate transformation, and Excel Destination. That data view gets pop up
That shows the result after aggregate, as Sidhart has total 107 sales, and jayanth has 154 sales
This is how the whole task executes, and stores the result in Excel destination


Solution: Reporting Services permissions are insufficient for performing operation (rsAccessDenied)

Here's what to do: 

  1. Make sure you have access configured to the URL http://localhost/reports using the SQL Reporting Services Configuration. To do this:
    1. Open Reporting Services Configuration Manager -> then connect to the report server instance  -> then click on Report Manager URL.
    2. In the Report Manager URL page, click the Advanced button -> then in the Multiple Identities for Report Manager, click Add.
    3. In the Add a Report Manager HTTP URL popup box, select Host Header and type in: localhost
    4. Click OK to save your changes.
  2. Now start/ run Internet Explorer using Run as Administator... 
      
    • NOTE: If you don't see the 'Site Settings' link in the top left corner while at http://localhost/reports it is probably because you aren't running IE as an Administator or you haven't assigned your computers 'domain\username' to the reporting services roles, see how to do this in the next few steps.
  3. Then go to: http://localhost/reports   (you may have to login with your Computer's username and password)
  4. You should now be directed to the Home page of SQL Server Reporting Services here:http://localhost/Reports/Pages/Folder.aspx
  5. From the Home page, click the Properties tab, then click New Role Assignment
  6. In the Group or user name textbox, add the 'domain\username' which was in the error message (in my case, I added:My-PC\MyUserName for the 'domain\username', in your case you can find the domain\username for your computer in the rsAccessDenied error message).
  7. Now check all the checkboxes; Browser, Content Manager, My Reports, Publisher, Report Builder, and then click OK.
  8. You're domain\username should now be assigned to the Roles that will give you access to deploy your reports to the Report Server.  If you're using Visual Studio or SQL Server Business Intelligence Development Studio to deploy your reports to your local reports server, you should now be able to.
  9. Hopefully, that helps you solve your Reports Server rsAccessDenied error message...
Just to let you know this tutorial was done on a Windows 7 computer with SQL Server Reporting Services 2008.

Popular Posts

Recent Posts

Unordered List

Text Widget