This Blog

Syndication

Archives

The SQL Chronicle for Pervasive PSQL

  • Extra Conditions in the Join

    January 2009

    The condition in the FROM clause of a SELECT statement is primarily meant to be used for joining tables. However, other conditions are allowed here even if they don’t really belong to the join. What you should know though is that moving a condition from the WHERE clause to the join condition (or vice versa) can actually affect the result of a query. This article explains how Pervasive PSQL handles these extra conditions.

    Let’s begin with an example. We will use the following two tables, TEAMS and PENALTIES. Both have a column called PLAYERNO on which those tables can be joined. These tables are part of the standard sample used in the book The SQL Guide to Pervasive PSQL. Note that this article is based on Section 7.11 of this book.

    The next SELECT statement contains a left outer join plus an additional condition in the WHERE clause.

    SELECT   TEAMS.PLAYERNO, TEAMS.TEAMNO, PENALTIES.PAYMENTNO
    FROM     TEAMS LEFT OUTER JOIN PENALTIES
             ON TEAMS.PLAYERNO = PENALTIES.PLAYERNO
    WHERE    DIVISION = 'second'

    The end result of the query is:

    PLAYERNO  TEAMNO  PAYMENTNO
    --------  ------  ---------
          27       2          3
          27       2          8
    

    How does Pervasive PSQL process this statement? Logically, the following four steps are executed in sequence.

    Step 1: A Cartesian product of the two tables is created. In other words, each row from the TEAMS table is linked to all the rows from the PENALTIES table. In this example that means that the intermediate result of this step consists of 2 times 8 is 16 rows.

    Step 2: In this step, all the conditions in the join condition are applied to the intermediate result of step 1. The condition in this query contains only one condition: TEAMS.PLAYERNO = PENALTIES.PLAYERNO. The intermediate result of this operation contains all the rows of the TEAMS table of which the player number appears in the PENALTIES table. This is the result (on the left are the columns of the TEAMS table, and on the right are those of the PENALTIES table):

    TEAMNO PLAYERNO DIVISION PAYNO PLAYERNO PAYMENT_DATE AMOUNT
    ------ -------- -------- ----- -------- ------------ ------
         1        6 first        1        6 1980-12-08   100.00
         2       27 second       3       27 1983-09-10   100.00
         2       27 second       8       27 1984-11-12    75.00
    

    Step 3: In this step, Pervasive PSQL checks whether an outer join has been specified and whether certain rows are missing and have to be returned. In this example, if teams disappear from this join, they must be returned because of the left outer join specification. However, all the teams in the TEAMS table (1 and 2) are in the intermediate result. So, nothing is missing and nothing has to be returned.

    Step 4: In this step, the above result is passed to the next clause and that clause is processed which, in this example, is the WHERE clause. The WHERE clause in this SELECT statement contains only one condition. It leads to removing all the rows in which the DIVISION column is not equal to ‘second’. The result is:

    TEAMNO PLAYERNO DIVISION PAYNO PLAYERNO PAYMENT_DATE AMOUNT
    ------ -------- -------- ----- -------- ------------ ------
         2       27 second       3       27 1983-09-10   100.00
         2       27 second       8       27 1984-11-12    75.00
    

    Only the two rows above are passed on to the next clause of the SELECT statement, which in this example is the SELECT clause.

    So, logically, this is how we should think about how Pervasive PSQL processes this query. Internally, of course, it will optimize the whole process to avoid, for example, having to execute resource intensive Cartesian products.

    Now the question is if it would make a difference whether we move the condition specified in the WHERE clause to the join condition? Will we get a different result? If we move the condition to the join condition, the following statement arises:

    SELECT   TEAMS.PLAYERNO, TEAMS.TEAMNO, PENALTIES.PAYMENTNO
    FROM     TEAMS LEFT OUTER JOIN PENALTIES
             ON TEAMS.PLAYERNO = PENALTIES.PLAYERNO
             AND DIVISION = 'second'
    

    Let’s go through those four steps again.

    Step 1: Create a Cartesian product of the tables involved. This will lead to the same intermediate result as in the previous statement. It will consist of 16 rows.

    Step 2: Process all the conditions in the join condition. This leads to the following intermediate result:

    TEAMNO  PLAYERNO  DIVISION  PAYNO  PLAYERNO  PAYMENT_DATE AMOUNT
    ------  --------  --------  -----  --------  ------------ ------
         2        27  second        3        27  1983-09-10   100.00
         2        27  second        8        27  1984-11-12    75.00
    

    As you can see, this result is different: only two rows instead of three.

    Step 3: Check for missing rows. Pervasive PSQL checks whether rows from the TEAMS table (because that is the table on the left of the left outer join) are missing from this intermediate result. Team 1 does not appear in this intermediate result because it does not play in the second division. Those rows have to be returned again. As a result, team 1 is added again, which leads to the following result (the question mark represents a null value):

    TEAMNO  PLAYERNO  DIVISION  PAYNO  PLAYERNO  PAYMENT_DATE AMOUNT
    ------  --------  --------  -----  --------  ------------ ------
         2        27  second        3        27  1983-09-10   100.00
         2        27  second        8        27  1984-11-12    75.00
         1         6  first         ?         ?  ?                 ?
    

    Step 4: Because of the absence of a WHERE clause, no rows are removed, and the intermediate result above is passed to the next clause. The final result of the query looks like this:

    PLAYERNO  TEAMNO  PAYMENTNO
    --------  ------  ---------
           6       1          ?
          27       2          3
          27       2          8
    

    The result of this statement clearly differs from that of the previous statement. Conclusion, it does matter where we specify a condition.

    Let’s use another example. The next SELECT statement contains a full outer join plus an additional condition in the FROM clause.

    SELECT   TEAMS.PLAYERNO, TEAMS.TEAMNO, PENALTIES.PAYMENTNO
    FROM     TEAMS FULL OUTER JOIN PENALTIES
             ON TEAMS.PLAYERNO = PENALTIES.PLAYERNO
             AND TEAMS.PLAYERNO > 1000
    

    The result is:

    PLAYERNO  TEAMNO  PAYMENTNO
    --------  ------  ---------
           ?       ?          3
           ?       ?          8
           ?       ?          1
           ?       ?          6
           ?       ?          2
           ?       ?          5
           ?       ?          7
           ?       ?          4
           6       1          ?
          27       2          ?
    

    After processing steps 1 and 2 the join has been processed and the intermediate result is empty. The reason is that there are no player numbers greater than 1000. Then, during step 3, Pervasive PSQL checks whether rows exist in the tables TEAMS and PENALTIES that don’t appear in the result. Well, all the teams and all the penalties are missing, so they have to be added again. This leads to the above end result.

    Conclusion: If an outer join is used, it does matter whether particular conditions are placed in the join condition or in the WHERE clause. Therefore, consider carefully where you want to place them. Note: this does not apply to inner joins but only to outer joins.

     

  • Subqueries Everywhere

     Date: December 2009

    Most developers already know how to use Pervasive PSQL’s SELECT statement, or how to write queries. And I believe them when they say that. But do they really understand all the features, and do they really use all the capabilities? By talking to many developers over the years I’ve discovered that in most cases that’s not the case. Most developers restrict themselves to using a subset of this statement, and therefore they are limiting themselves. As a result the programs become unnecessarily complex, and they are not as efficient as they could be. It would be comparable to buying a house, and restricting yourself by using just a few of the rooms. That wouldn’t make sense. The better a developer understands SQL, the more efficient his or her programs will become and the better Pervasive PSQL is exploited. This and coming blogs will make developers more aware of the features of SQL as implemented in Pervasive PSQL. This first one is devoted to subqueries.

    Subqueries have been part of Pervasive PSQL's SQL dialect for a long time now. But in the beginning subqueries could only be used within the WHERE and HAVING clauses of the SELECT, UPDATE, and DELETE statements. Nowhere else could you specify a subquery. In fact, its use was constrained to selecting rows from intermediate results. Currently, it is possible to specify subqueries almost anywhere, also in the SELECT clause. Besides making the SQL dialect more orthogonal, it also simplifies writing certain queries. Through a few examples, this article shows the power of using subqueries in the SELECT clause.

    By the way, the word orthogonal in the context of programming languages means, amongst other things, that the number of odd exceptions is reduced to a minimum.

    To illustrate the use of subqueries in the SELECT clause, we use the three tables presented in Figure 1. In these three tables the columns PLAYERS.PNO, PENALTIES.PENNO, and TEAMS.TEAMNO represent the primary keys. The two columns PENALTIES.PNO and TEAMS.PNO are foreign keys pointing to the primary key of the PLAYERS table. Note: This example is derived from the standard example used throughout the book The SQL Guide to Pervasive PSQL.

     Figure 3: The Three Sample Tables

    Let’s start this discussion with an example.

    Example 1: For each player get the name and the number of his or her penalties. Without subqueries the query would look as follows:

    SELECT   PLAYERS.PNO, PLAYERS.NAME, COUNT(PENALTIES.PENNO)
    FROM     PLAYERS LEFT OUTER JOIN PENALTIES
             ON (PLAYERS.PNO = PENALTIES.PNO)
    GROUP BY PLAYERS.PNO, PLAYERS.NAME
    

    The result is:

    PNO  NAME       COUNT(PENALTIES.PENNO)
    ---  ---------  ----------------------
      1  Metheny                         2
      2  Pastorius                       1
      3  Zawinul                         0
    

    If you use subqueries within the SELECT clause instead, the query would look differently:

    SELECT   PLAYERS.PNO, PLAYERS.NAME,
            (SELECT   COUNT(PENALTIES.PENNO)
             FROM     PENALTIES
             WHERE    PENALTIES.PNO = PLAYERS.PNO)
    FROM     PLAYERS
    

    What is special about this solution is that the main query does not contain a join, but its SELECT clause contains a subquery. This subquery is correlated because it contains a reference to the main query (PLAYERS.PNO). That means we have to assume that for each row in the PLAYERS table Pervasive PSQL processes this subquery. And because it contains a correlation, it might return a different result for each row. This subquery counts for each player separately his or her number of penalties.

    In the solution with the join, we have to be sure we specify the correct join, in this case a left outer join. If we hadn’t done that, but had used an inner join instead, we would not have seen player 3 in the end result, because that player doesn’t have penalties. In the second solution, this is not an issue. The subquery will automatically return the right result. If a player has no penalties, the subquery will return the value zero, which is exactly what we want.

    Is this query really easier to formulate than the previous one? Probably not, they are comparable in complexity. However, an advantage of the second solution is that the query resembles the original question somewhat more, and that makes it a little easier to write and read. But that is not a very strong argument for using the second solution. However, there are queries that are only easy to formulate if this construct is used. Let’s demonstrate that with another example.

    Example 2: For each player get the name, the number of penalties, plus the number of times this player appears in the TEAMS table.

    If we would extend the first query to cater for this extension, we would get this solution:

    SELECT   PLAYERS.PNO, PLAYERS.NAME, COUNT(PENALTIES.AMOUNT), COUNT(TEAMS.TEAMNO)
    FROM    (PLAYERS LEFT OUTER JOIN PENALTIES
             ON (PLAYERS.PNO = PENALTIES.PNO)) 
                 LEFT OUTER JOIN TEAMS
                 ON (PLAYERS.PNO = TEAMS.PNO)
    GROUP BY PLAYERS.PNO, PLAYERS.NAME
    

    The result is:

    PNO  NAME       COUNT(PENALTIES.AMOUNT)  COUNT(TEAMS.TEAMNO)
    ---  ---------  -----------------------  -------------------
      1  Metheny                          2                    0
      2  Pastorius                        2                    2
      3  Zawinul                          0                    1
    

    The two joins make this query hard to read. But more importantly, this result might look correct, but it isn’t. The second row shows that player 2 has two penalties, which is not true; he has only one penalty. The problem is that player 2 appears once in PENALTIES and twice in TEAMS. The join results in a multiplication of rows: 1 times 2, and as a result the number of penalties is too high.

    With subqueries the query would look different and would return a correct result:

    SELECT   PLAYERS.PNO, PLAYERS.NAME,
            (SELECT   COUNT(*)
             FROM     PENALTIES
             WHERE    PENALTIES.PNO = PLAYERS.PNO),
            (SELECT   COUNT(*)
             FROM     TEAMS
             WHERE    TEAMS.PNO = PLAYERS.PNO)
    FROM     PLAYERS
    

    The result is:

    PNO  NAME       COUNT(PENALTIES.AMOUNT)  COUNT(TEAMS.TEAMNO)
    ---  ---------  -----------------------  -------------------
      1  Metheny                          2                    0
      2  Pastorius                        1                    2
      3  Zawinul                          0                    1
    

    This result is correct, you can check that yourself if you want to. In addition, the solution is quite straightforward. This is a clear example of where the use of subqueries in the SELECT clause is the only correct solution. We could not have written this query using joins.

    There are other examples as well that show that when you allow the use of subqueries in the SELECT clause, it makes life easier for developers.

    Example 3: For each penalty get the difference between the amount and the average amount. 

    SELECT   PENNO, PNO,
             ABS(AMOUNT - (SELECT AVG(AMOUNT) FROM PENALTIES)) AS DIFF
    FROM     PENALTIES
    
    The result is:
    PENNO  PNO  DIFF
    -----  ---  ----
        1   1     10
        2   1      0
        3   2     10
    
     
    Example 4: For each penalty show the amount plus the cumulative amount. 
     
    SELECT   PENNO, AMOUNT,
            (SELECT SUM(AMOUNT)
             FROM  PENALTIES AS SUB_PEN
             WHERE  SUB_PEN.PENNO <= MAIN_PEN.PENNO) AS CUM_TOTAL
    FROM     PENALTIES AS MAIN_PEN
    

    The result is:

    PENNO  AMOUNT  CUM_TOTAL
    -----  ------  ---------
        1      60         60
        2      50        110
        3      40        150
    

    Summary: Using subqueries in the SELECT clause really does extend the power of Pervasive PSQL’s SQL dialect. It reduces the complexity of certain queries, and it can lead to SELECT statements that are easy to write, understand, and maintain. If you haven’t been using this feature yet, I strongly recommend you to get comfortable with it.

More Posts