问题描述:

I have an SQL query which looks at date-of-birth, last name and a soundex of first name to identify duplicates. The following query finds some 8,000 rows (which I assume means there are around 8,000 duplicate records).

select dob,last_name,soundex(first_name),count(*)

from clients

group by dob,last_name,soundex(first_name)

having count(*) >1

Almost all of the results have a count of 2, a few have a count of 3 where obviously the record existed twice in one of the two databases which were merged.

The next step I need to take is to mark one of the rows, doesn't really matter, with a duplicate flag and to mark each row with the opposite rows key. Is there a way of doing this using SQL?

网友答案:

This should do what you are after, the UPDATE in one go.

UPDATE FROM clients c
INNER JOIN
(
  select dob,last_name,soundex(first_name),MIN(id) as keep
  from clients
  group by dob,last_name,soundex(first_name)
  having count(*) >1
) k
ON c.dob=k.dob AND c.last_name=k.last_name AND soundex(c.first_name)=soundex(k.first_name)
SET duplicateid = NULLIF(k.keep, c.id),
    hasduplicate = (k.keep = c.id)

It assumes you have 3 columns not stated in the question

  • id: primary key
  • duplicateid: points to the dup being kept
  • hasduplicate: boolean, marks the one to keep
网友答案:

Well, you could use SELECT DISTINCT, and then mark a single row as "not duplicate" -- then search for rows that are "not duplicate" to find the duplicate.

网友答案:

Here is a query that will give you not only the duplicates, but also the first id inserted (assuming Id is the sequential primary-key column) and the newest id.

OTTOMH

select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
from clients
group by dob,last_name,soundex(first_name)
having count(*) >1

You can use this in a JOIN to do your update

UPDATE Clients
SET OppositeRowId = DuplicateRows.NewestId
FROM
(
    select dob, last_name, soundex(first_name) firstnamesoundex, min (Id) OldestId, max (Id) NewestId, Count (*) NumRows
    from clients
    group by dob,last_name,soundex(first_name)
    having count(*) >1
) DuplicateRows
WHERE
DuplicateRows.OldestId = Clients.Id

All of this assumes that you have one duplicate. If you have more than one, you are going to have to try something different.

相关阅读:
Top