Table Deduplication with Postgresql
ctid
Every
table
has several system columns that are implicitly defined by the system.ctid
, the physical location of the row version within itstable
. Note that although the ctid can be used to locate the row version very quickly, a row’s ctid will change each time it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows.
In this case, ctid
can be used for deduplication.
Create test table
1 | test_table=# create table test( |
Insert test data
1 | test_table=# insert into test values(1, 'David'); |
Check raw data
1 | test_table=# select *,ctid from test; |
Finding duplication
Here we want to delete rows with same id
. Finding data by unique id
:
1 | test_table=# select distinct id, count(*) from test group by id having count(*) > 1; |
It means we have 3 rows with id = 1
, 3 rows with id = 2
, 2 rows with id = 3
.
Finding rows needed to kept
1 | test_table=# select ctid, * from test where ctid in (select min(ctid) from test group by id); |
Because ctid
means phsical address of each rows, min(ctid)
gets the first row by query. select min(ctid) from emp group by id
gets phsical addresses of the first rows by identified id
.
Delete duplication
Simply, we can delete the rows not in select min(ctid) from emp group by id
. Identified id
rows will be remained.
1 | test_table=# delete from test where ctid not in (select min(ctid) from test group by id); |
Check result
1 | test_table=# select * from test; |
Refrence
[1] PostgreSQL 8.2.23 Documentation
[2] PostgreSQL删除重复数据
[3] Removing duplicate rows in PostgreSQL
[4] PostgreSQL SELECT DISTINCT