I banged my head against the keyboard for a while on this one trying to find one query that would quickly get me the result, but I couldn't seem to find one. I couldn't figure out a way to do it that didn't involve scanning the entire table looking for duplicate data for each row. So for 100,000 records, that meant 100,0002 scans. Which was garbage.
Instead I settled on a divide-and-conquer strategy to solve the problem. I first used a GROUP BY/HAVING query to find the number of duplicates...
SELECT CONCAT(column_1, ' ', column_2, ' ', column_3) AS searcher, COUNT(CONCAT(column_1, ' ', column_2, ' ', column_3)) AS n
FROM table GROUP BY searcher HAVING n>1 ORDER BY n ASC... assuming we want all records that have duplicate values in column_1, column_2, and column_3. So if 2 or more records in the table have the same values in each of these columns, we'll get the column values and the number of records that contain them from the above query.
Here's what some duplicate records might look like in our table...

Our query would return a result that looked like this:
(bb, cc, dd), 3
(11, 22, 33), 2
(a1, s1, d1), 2
... for each group of duplicated values in our specified columns. So we know how many duplicates occur and which values are duplicated.
This doesn't get us exactly what we want. But it gets us closer. It shows all instances in the database where the columns of interest are duplicated, and it shows us how many times they're duplicated. So we've narrowed the problem down a bit.
Now all we have to do is run a query for each row returned by the query above, selecting the records that contain those values and displaying them in some way. It still takes longer to run than I'd like, but it still gets the result in a manner of seconds.
No comments:
Post a Comment