Table Deduplication with Postgresql

`ctid`

Every table has several system columns that are implicitly defined by the system.
ctid, the physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row’s ctid will change each time it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows.

In this case, ctid can be used for deduplication.

Create test table

test_table=# create table test(
test_table(# id int,
test_table(# name varchar);
CREATE TABLE

Insert test data

test_table=# insert into test values(1, 'David');
INSERT 0 1
test_table=# insert into test values(1, 'Wang');
INSERT 0 1
test_table=# insert into test values(1, 'Fatti');
INSERT 0 1
test_table=# insert into test values(2, 'Haupt');
INSERT 0 1
test_table=# insert into test values(2, 'Ost');
INSERT 0 1
test_table=# insert into test values(2, 'Ost');
INSERT 0 1
test_table=# insert into test values(3, 'Buch');
INSERT 0 1
test_table=# insert into test values(3, 'Musik');
INSERT 0 1

Check raw data

test_table=# select *,ctid from test;
 id | name  | ctid  
----+-------+-------
  1 | David | (0,1)
  1 | Wang  | (0,2)
  1 | Fatti | (0,3)
  2 | Haupt | (0,4)
  2 | Ost   | (0,5)
  2 | Ost   | (0,6)
  3 | Buch  | (0,7)
  3 | Musik | (0,8)
(8 rows)

Finding duplication

Here we want to delete rows with same id. Finding data by unique id:

test_table=# select distinct id, count(*) from test group by id having count(*) > 1;
 id | count 
----+-------
  1 |     3
  2 |     3
  3 |     2
(3 rows)

It means we have 3 rows with id = 1, 3 rows with id = 2, 2 rows with id = 3.

Finding rows needed to kept

test_table=# select ctid, * from test where ctid in (select min(ctid) from test group by id);
 ctid  | id | name  
-------+----+-------
 (0,1) |  1 | David
 (0,4) |  2 | Haupt
 (0,7) |  3 | Buch
(3 rows)

Because ctid means phsical address of each rows, min(ctid) gets the first row by query. select min(ctid) from emp group by id gets phsical addresses of the first rows by identified id.

Delete duplication

Simply, we can delete the rows not in select min(ctid) from emp group by id. Identified id rows will be remained.

1 2	test_table=# delete from test where ctid not in (select min(ctid) from test group by id); DELETE 5

Check result

test_table=# select * from test;
 id | name  
----+-------
  1 | David
  2 | Haupt
  3 | Buch
(3 rows)

Refrence

[1] PostgreSQL 8.2.23 Documentation
[2] PostgreSQL删除重复数据
[3] Removing duplicate rows in PostgreSQL
[4] PostgreSQL SELECT DISTINCT