Friday, July 17, 2009

Finding Duplicate Records In A Database Table

Say you have a good-sized table in MySQL with about 100,000 records in it and you want to find all records that contain duplicate values in one or more of their columns. This has nothing to do with keys or indices, just any old value in one or more specified columns...

I banged my head against the keyboard for a while on this one trying to find one query that would quickly get me the result, but I couldn't seem to find one. I couldn't figure out a way to do it that didn't involve scanning the entire table looking for duplicate data for each row. So for 100,000 records, that meant 100,0002 scans. Which was garbage.

Instead I settled on a divide-and-conquer strategy to solve the problem. I first used a GROUP BY/HAVING query to find the number of duplicates...

SELECT CONCAT(column_1, ' ', column_2, ' ', column_3) AS searcher, COUNT(CONCAT(column_1, ' ', column_2, ' ', column_3)) AS n
FROM table GROUP BY searcher HAVING n>1 ORDER BY n ASC


... assuming we want all records that have duplicate values in column_1, column_2, and column_3. So if 2 or more records in the table have the same values in each of these columns, we'll get the column values and the number of records that contain them from the above query.

Here's what some duplicate records might look like in our table...



Our query would return a result that looked like this:

(bb, cc, dd), 3
(11, 22, 33), 2
(a1, s1, d1), 2

... for each group of duplicated values in our specified columns. So we know how many duplicates occur and which values are duplicated.

This doesn't get us exactly what we want. But it gets us closer. It shows all instances in the database where the columns of interest are duplicated, and it shows us how many times they're duplicated. So we've narrowed the problem down a bit.

Now all we have to do is run a query for each row returned by the query above, selecting the records that contain those values and displaying them in some way. It still takes longer to run than I'd like, but it still gets the result in a manner of seconds.

How would you store a tree in a relational database table?

I was asked this in an interview a few years ago. I choked on it. Completely froze and gave a very wrong answer. The interviewers didn't push me on it. They just kinda nodded their heads and moved on with slightly confused looks on their faces. I left really annoyed with myself and worked out the right answer to the question about 17 minutes after I'd left the place. The sad thing is that it was such an easy question. All I had to do was draw a tree and work out the solution. Instead I didn't bother thinking at all about how we use and traverse trees. I was so nervous that I spewed out some nonsense.

The answer I came up with in the car afterwards, and I think this is a correct answer is this:

You have to first decide on the method tree traversal: depth- or breadth-first. Say we choose breadth-first. Next we traverse the tree and store each node in the database table. Say the table has 4 columns: an auto-increment index, node_name, node_value, and parent_node. So each row in our table contains (index, node_name, node_value, parent_node)

Now we traverse the tree, first visiting the root node and storing it. Assume it's named p0. So we store (index, p0, p0-value, null). We then visit the root node's children from left to right and for each of these we store (index+1, pn, pn-value, p0) where index+1 is the auto-increment index, pn is the node name, pn-value is the node value, and p0 is the parent node (currently root). Repeat until the we have breadth-first traversed and stored the entire tree.

To reconstruct the tree, select the root node from the database (parent_node = null) and add it to the tree. Then select all nodes in the database with the root node as parent, in order of their index, and add them to the tree as children of root. Next select all nodes in the database with a parent equal to the left-most child of the root node, and add them to the tree in order of their indices. We have to breadth-first traverse the tree as we add nodes to it to complete it.

Why This Is Here

You Should Write Blogs by Steve Yegge

Why I Blog by Andrew Sullivan