Dave's Blog: Finding Duplicate Records In A Database Table

Say you have a good-sized table in MySQL with about 100,000 records in it and you want to find all records that contain duplicate values in one or more of their columns. This has nothing to do with keys or indices, just any old value in one or more specified columns...

I banged my head against the keyboard for a while on this one trying to find one query that would quickly get me the result, but I couldn't seem to find one. I couldn't figure out a way to do it that didn't involve scanning the entire table looking for duplicate data for each row. So for 100,000 records, that meant 100,000² scans. Which was garbage.

Instead I settled on a divide-and-conquer strategy to solve the problem. I first used a GROUP BY/HAVING query to find the number of duplicates...

SELECT CONCAT(column_1, ' ', column_2, ' ', column_3) AS searcher, COUNT(CONCAT(column_1, ' ', column_2, ' ', column_3)) AS n
FROM table GROUP BY searcher HAVING n>1 ORDER BY n ASC

... assuming we want all records that have duplicate values in column_1, column_2, and column_3. So if 2 or more records in the table have the same values in each of these columns, we'll get the column values and the number of records that contain them from the above query.

Here's what some duplicate records might look like in our table...

Our query would return a result that looked like this:

(bb, cc, dd), 3
(11, 22, 33), 2
(a1, s1, d1), 2

... for each group of duplicated values in our specified columns. So we know how many duplicates occur and which values are duplicated.

This doesn't get us exactly what we want. But it gets us closer. It shows all instances in the database where the columns of interest are duplicated, and it shows us how many times they're duplicated. So we've narrowed the problem down a bit.

Now all we have to do is run a query for each row returned by the query above, selecting the records that contain those values and displaying them in some way. It still takes longer to run than I'd like, but it still gets the result in a manner of seconds.

Dave's Blog

Friday, July 17, 2009

Finding Duplicate Records In A Database Table

No comments:

Post a Comment

Blog Archive