Databases - Practical PostgreSQL - Removing Duplicate Rows with DISTINCT

Databases - Practical PostgreSQL
Previous Page	Home	Next Page

Removing Duplicate Rows with DISTINCT

The optional DISTINCT keyword excludes duplicate rows from the result set. If supplied without the ON clause, a query that specifies DISTINCT will exclude any row whose target columns have already been retrieved identically. Only columns in the SELECT's target list will be evaluated.

For example, the books table has 15 rows, each with an author_id. Some authors may have several entries in the books table, causing there to be several rows with the same author_id. Supplying the DISTINCT clause, as shown in the first query in Example 4-31, ensures that the result set will not have two identical rows.

Example 4-31. Using DISTINCT

booktown=# 
SELECT DISTINCT author_id

booktown-# 
       FROM books;

 author_id
-----------
        16
       115
      1212
      1644
      1809
      1866
      2001
      2031
      4156
      7805
      7806
     15990
     25041
(13 rows)

booktown=# 
SELECT DISTINCT ON (author_id)

booktown-# 
       author_id, title

booktown-# 
       FROM books;

 author_id |         title
-----------+-----------------------
        16 | Little Women
       115 | The Tell-Tale Heart
      1212 | Practical PostgreSQL
      1644 | Dynamic Anatomy
      1809 | The Cat in the Hat
      1866 | Dune
      2001 | 2001: A Space Odyssey
      2031 | Goodnight Moon
      4156 | The Shining
      7805 | Programming Python
      7806 | Perl Cookbook
     15990 | Franklin in the Dark
     25041 | The Velveteen Rabbit
(13 rows)

As you can see, the first query in Example 4-31 returns only 13 rows from the books table, even though there are 15 total rows within it. Two authors with two books each end up being displayed only once.

The second query in Example 4-31 uses a different form of DISTINCT, which specifies the columns (or expressions) to be checked for redundancies. In this case, 13 rows are still returned, as the ON clause specifies to use the author_id column as the basis for determining if a row is redundant or not. Without the ON clause, the second query would return all 15 rows, because the DISTINCT clause would cause PostgreSQL to look for rows that are completely unique.

The titles that are omitted from the resultant data set by ON are arbitrarily determined by PostgreSQL, unless an ORDER BY clause is specified. If the ORDER BY clause is used with DISTINCT, you can specify the order in which columns are selected; hence, you can select which rows will be considered distinct first. See the Section called Sorting Rows with ORDER BY " for information about sorting rows.

If you are interested in grouping rows which have non-unique criteria, rather than omitting all rows but one, see the description of the GROUP BY clause in the Section called Grouping Rows with GROUP BY ."

Databases - Practical PostgreSQL
Previous Page	Home	Next Page