André Kienitz

At our customer we found duplicated data in a table where duplicated data should not be there. In our analysis we found out that the table was loaded with the following insert statement:

select t1.*, t2.*
from mytable1 t1
left outer join
(select t2.col1, t2.col2, rank() over (partition by t2.col1 order by t2.col2) as rnk
from mytable2 t2
where rnk = 1)
on t1.col1 = t2.col1

Here the rank function is used in order to get unique data from table mytable2. Unfortunately this does not work as the rank function is not unique. That means that you can have more than one row with the value rnk = 1. For example, when the lowest value for col2 in mytable2 appears three times, then we have three rows with rank value 1.

SQL Server link for rank function

Instead of using the rank function you can use the Row_Number function. The Row_Number function is unique.
Thus the correct sql statement is:

select t1.*, t2.*
from mytable1 t1
left outer join
(select t2.col1, t2.col2, row_number() over (partition by t2.col1 order by t2.col2) as rnk
from mytable2 t2
where rnk = 1)
on t1.col1 = t2.col1

When the lowest value for col2 in mytable2 appears three times, then there is only one row with row_number 1. The other two columns get the row_number 2 and 3.

SQL Server link for row_number function