Simple example

In the following we will create a simple example to find duplicates using SQLPhonetics.NET for SSIS.

In this example, we first load data from a table. The data comprises the following columns:

  • Firstname
  • Lastname
  • Street
  • Postalcode
  • City
  • Country

 

To perform the duplicate check, first a connection to a database server on which the component can store data during runtime must be established. The closer this server is to the SSIS runtime, the faster the data processing. In our example, we create an ADO.NET connection called SQLPhonetics.NET TempDB. Additionally, you can configure the Windows size, the number of threads and the partial matching. In addition to these settings, you can also export the entire profile as a JSON file. Further information on the individual settings can be found in the corresponding chapters.

To perform a simple search, we create a new search parameter under Search Parameter using the button Add. In the following dialog you can define for each available input column to what percentage (threshold) the comparative data records must be identical in order to be a duplicate. You can select one of more than 10 algorithms for each column. This allows you to use an algorithm that best matches the respective data. For example, the Levenshtein algorithm works very well for postal codes, but the Cologne algorithm works very well for last names. Define as many different search parameters as possible to detect different errors in the data sets. You will never be able to define a golden rule that finds all duplicates at once.

In the next step, you must define the sort keys in the Sort Keys tab. Using the sort keys, you define together with the Windows Size in which range data records are to be compared with each other. Define at least two different sort keys for each search parameter to get a good result. A sort key should always contain the fields that you want to compare with the search parameters.

Once you have made all the settings, you can close the component editor. If you now open the Data Flow Path Editor in your data flow, you see 2 additional columns for your data. The column Parameter set 0 indicates all duplicates for a group that were found with this parameter set. The MatchingGroup column shows the duplicates for all parameter sets.