The SQL Chronicle for Pervasive PSQL

Subqueries Everywhere

 Date: December 2009

Most developers already know how to use Pervasive PSQL’s SELECT statement, or how to write queries. And I believe them when they say that. But do they really understand all the features, and do they really use all the capabilities? By talking to many developers over the years I’ve discovered that in most cases that’s not the case. Most developers restrict themselves to using a subset of this statement, and therefore they are limiting themselves. As a result the programs become unnecessarily complex, and they are not as efficient as they could be. It would be comparable to buying a house, and restricting yourself by using just a few of the rooms. That wouldn’t make sense. The better a developer understands SQL, the more efficient his or her programs will become and the better Pervasive PSQL is exploited. This and coming blogs will make developers more aware of the features of SQL as implemented in Pervasive PSQL. This first one is devoted to subqueries.

Subqueries have been part of Pervasive PSQL's SQL dialect for a long time now. But in the beginning subqueries could only be used within the WHERE and HAVING clauses of the SELECT, UPDATE, and DELETE statements. Nowhere else could you specify a subquery. In fact, its use was constrained to selecting rows from intermediate results. Currently, it is possible to specify subqueries almost anywhere, also in the SELECT clause. Besides making the SQL dialect more orthogonal, it also simplifies writing certain queries. Through a few examples, this article shows the power of using subqueries in the SELECT clause.

By the way, the word orthogonal in the context of programming languages means, amongst other things, that the number of odd exceptions is reduced to a minimum.

To illustrate the use of subqueries in the SELECT clause, we use the three tables presented in Figure 1. In these three tables the columns PLAYERS.PNO, PENALTIES.PENNO, and TEAMS.TEAMNO represent the primary keys. The two columns PENALTIES.PNO and TEAMS.PNO are foreign keys pointing to the primary key of the PLAYERS table. Note: This example is derived from the standard example used throughout the book The SQL Guide to Pervasive PSQL.

 Figure 3: The Three Sample Tables

Let’s start this discussion with an example.

Example 1: For each player get the name and the number of his or her penalties. Without subqueries the query would look as follows:

SELECT   PLAYERS.PNO, PLAYERS.NAME, COUNT(PENALTIES.PENNO)
FROM     PLAYERS LEFT OUTER JOIN PENALTIES
         ON (PLAYERS.PNO = PENALTIES.PNO)
GROUP BY PLAYERS.PNO, PLAYERS.NAME

The result is:

PNO  NAME       COUNT(PENALTIES.PENNO)
---  ---------  ----------------------
  1  Metheny                         2
  2  Pastorius                       1
  3  Zawinul                         0

If you use subqueries within the SELECT clause instead, the query would look differently:

SELECT   PLAYERS.PNO, PLAYERS.NAME,
        (SELECT   COUNT(PENALTIES.PENNO)
         FROM     PENALTIES
         WHERE    PENALTIES.PNO = PLAYERS.PNO)
FROM     PLAYERS

What is special about this solution is that the main query does not contain a join, but its SELECT clause contains a subquery. This subquery is correlated because it contains a reference to the main query (PLAYERS.PNO). That means we have to assume that for each row in the PLAYERS table Pervasive PSQL processes this subquery. And because it contains a correlation, it might return a different result for each row. This subquery counts for each player separately his or her number of penalties.

In the solution with the join, we have to be sure we specify the correct join, in this case a left outer join. If we hadn’t done that, but had used an inner join instead, we would not have seen player 3 in the end result, because that player doesn’t have penalties. In the second solution, this is not an issue. The subquery will automatically return the right result. If a player has no penalties, the subquery will return the value zero, which is exactly what we want.

Is this query really easier to formulate than the previous one? Probably not, they are comparable in complexity. However, an advantage of the second solution is that the query resembles the original question somewhat more, and that makes it a little easier to write and read. But that is not a very strong argument for using the second solution. However, there are queries that are only easy to formulate if this construct is used. Let’s demonstrate that with another example.

Example 2: For each player get the name, the number of penalties, plus the number of times this player appears in the TEAMS table.

If we would extend the first query to cater for this extension, we would get this solution:

SELECT   PLAYERS.PNO, PLAYERS.NAME, COUNT(PENALTIES.AMOUNT), COUNT(TEAMS.TEAMNO)
FROM    (PLAYERS LEFT OUTER JOIN PENALTIES
         ON (PLAYERS.PNO = PENALTIES.PNO)) 
             LEFT OUTER JOIN TEAMS
             ON (PLAYERS.PNO = TEAMS.PNO)
GROUP BY PLAYERS.PNO, PLAYERS.NAME

The result is:

PNO  NAME       COUNT(PENALTIES.AMOUNT)  COUNT(TEAMS.TEAMNO)
---  ---------  -----------------------  -------------------
  1  Metheny                          2                    0
  2  Pastorius                        2                    2
  3  Zawinul                          0                    1

The two joins make this query hard to read. But more importantly, this result might look correct, but it isn’t. The second row shows that player 2 has two penalties, which is not true; he has only one penalty. The problem is that player 2 appears once in PENALTIES and twice in TEAMS. The join results in a multiplication of rows: 1 times 2, and as a result the number of penalties is too high.

With subqueries the query would look different and would return a correct result:

SELECT   PLAYERS.PNO, PLAYERS.NAME,
        (SELECT   COUNT(*)
         FROM     PENALTIES
         WHERE    PENALTIES.PNO = PLAYERS.PNO),
        (SELECT   COUNT(*)
         FROM     TEAMS
         WHERE    TEAMS.PNO = PLAYERS.PNO)
FROM     PLAYERS

The result is:

PNO  NAME       COUNT(PENALTIES.AMOUNT)  COUNT(TEAMS.TEAMNO)
---  ---------  -----------------------  -------------------
  1  Metheny                          2                    0
  2  Pastorius                        1                    2
  3  Zawinul                          0                    1

This result is correct, you can check that yourself if you want to. In addition, the solution is quite straightforward. This is a clear example of where the use of subqueries in the SELECT clause is the only correct solution. We could not have written this query using joins.

There are other examples as well that show that when you allow the use of subqueries in the SELECT clause, it makes life easier for developers.

Example 3: For each penalty get the difference between the amount and the average amount. 

SELECT   PENNO, PNO,
         ABS(AMOUNT - (SELECT AVG(AMOUNT) FROM PENALTIES)) AS DIFF
FROM     PENALTIES
The result is:
PENNO  PNO  DIFF
-----  ---  ----
    1   1     10
    2   1      0
    3   2     10
 
Example 4: For each penalty show the amount plus the cumulative amount. 
 
SELECT   PENNO, AMOUNT,
        (SELECT SUM(AMOUNT)
         FROM  PENALTIES AS SUB_PEN
         WHERE  SUB_PEN.PENNO <= MAIN_PEN.PENNO) AS CUM_TOTAL
FROM     PENALTIES AS MAIN_PEN

The result is:

PENNO  AMOUNT  CUM_TOTAL
-----  ------  ---------
    1      60         60
    2      50        110
    3      40        150

Summary: Using subqueries in the SELECT clause really does extend the power of Pervasive PSQL’s SQL dialect. It reduces the complexity of certain queries, and it can lead to SELECT statements that are easy to write, understand, and maintain. If you haven’t been using this feature yet, I strongly recommend you to get comfortable with it.

Comments

No Comments

About Rick vd Lans

Rick is an independent consultant, speaker, and author, specializing in SQL, database technology, data warehousing, and business intelligence. He is managing director and founder of R20/Consultancy (www.r20.nl). He is the author of several books on SQL, including the popular Introduction to SQL which is also available in Chinese, Dutch, Italian, and German. In 2009 he published The SQL Guide to Pervasive PSQL, the first book in years on PSQL. He is an internationally acclaimed speaker who has lectured worldwide for the last 25 years. Currently, he is promoting a new architecture for data warehousing called the Data Delivery Platform. He has been a member of the Dutch ISO standardization committee for SQL for seven years. Rick may be contacted by sending an email to info@r20.nl.