The SQL Chronicle for Pervasive PSQL

  • Changing the Order of Columns in a Table

    Date: June 2010

    When new columns are added, they’re always placed at the end of the table. It’s possible to move existing columns within a table using an ALTER TABLE statement. For example, we can move the column that occupies the first position in a table to the last position. This article explains how columns can be moved, how we can see what the position of a column is, and what the difference is between the logical and the physical position of a column.

    To illustrate all this, we use the following two tables. Both are described in my book The SQL Guide to Pervasive PSQL:

    The TEAMS table:

    TEAMNO  PLAYERNO  DIVISION
    ------  --------  --------
         1         6  first
         2        27  second
    

    The MATCHES table:

    MATCHNO  TEAMNO  PLAYERNO  WON  LOST
    -------  ------  --------  ---  ----
          1       1         6    3     1
          2       1         6    2     3
          3       1         6    3     0
          4       1        44    3     2
          5       1        83    0     3
          6       1         2    1     3
          7       1        57    3     0
          8       1         8    0     3
          9       2        27    3     2
         10       2       104    3     2
         11       2       112    2     3
         12       2       112    1     3
         13       2         8    0     3
    

    Let’s begin by explaining what we mean by the first and the last position. Each column has both a logical and a physical position. The logical position is the position of the column shown when you use * in the SELECT clause. For example, if we process the following statement:

    SELECT   *
    FROM     TEAMS
    

    the columns in the TEAMS table are displayed in the order TEAMNO, PLAYERNO, and DIVISION, because this is the order in which the columns were entered in the CREATE TABLE statement.

    In essence, the logical position of a column is its default position when SELECT * is used. This position doesn’t depend on its physical position. The physical position of a column is the actual position of the column in the file. A column’s logical and physical position may be different. When a table has just been created, the logical position will always be the same as its physical position. The ALTER TABLE statement can change both types of positions which may result in the case where the two positions for a column are no longer the same.

    Example 1: Move the column PLAYERNO in the MATCHES table from the third to the fifth logical position. Next, show that the PLAYERNO column has indeed been moved.

    ALTER TABLE MATCHES
       ( PSQL_MOVE COLUMN PLAYERNO TO 5 )
    
    SELECT   *
    FROM     MATCHES
    WHERE    MATCHNO = 1
    

    The result is:

    MATCHNO  TEAMNO  WON  LOST  PLAYERNO
    -------  ------  ---  ----  --------
          1       1    0     1         6
    

    Explanation: The result shows that PLAYERNO has been moved to the end. The column position you provide must be a whole number greater than 0 and smaller than or equal to the position of the last column. In the previous ALTER TABLE statement we were not allowed to use position 6 or higher, because the MATCHES table has just five columns. However, the column hasn’t been moved physically. In the file, all player numbers are still in the second position. In the book, a view is created that can be used to query column related data. The definition of this view looks like this:

    CREATE   VIEW COLUMNS (TABLE_NAME, COLUMN_NAME, COLUMN_NO, DATA_TYPE,
             CHAR_LENGTH, PRECISION, SCALE, NULLABLE) AS
    SELECT   UPPER(TAB.XF$NAME), UPPER(COL.XE$NAME),
            (SELECT COUNT(*)
             FROM   X$FIELD AS F
             WHERE  F.XE$FILE = TAB.XF$ID
             AND    F.XE$OFFSET <= COL.XE$OFFSET
             AND    F.XE$DATATYPE NOT IN (227,255)), 
             CASE XE$DATATYPE
                WHEN 0 THEN 'STRING'
                WHEN 1 THEN
                   CASE XE$SIZE
                      WHEN 1 THEN 'TINYINT'
                      WHEN 2 THEN 'SMALLINT'
                      WHEN 4 THEN 'INTEGER'
                      WHEN 8 THEN 'BIGINT'
                      ELSE 'UNKNOWN INTEGER'
                   END
                WHEN 2 THEN
                   CASE XE$SIZE
                      WHEN 4 THEN 'REAL'
                      WHEN 8 THEN 'DOUBLE'
                      ELSE 'UNKNOWN REAL'
                   END
                WHEN 3 THEN 'DATE'
                WHEN 4 THEN 'TIME'
                WHEN 5 THEN 'DECIMAL'
                WHEN 8 THEN 'NUMERIC'
                WHEN 9 THEN
                   CASE XE$SIZE
                      WHEN 4 THEN 'BFLOAT4'
                      WHEN 8 THEN 'BFLOAT8'
                      ELSE 'UNKNOWN BFLOAT'
                   END
                WHEN 11 THEN 'VARCHAR'
                WHEN 14 THEN
                   CASE XE$SIZE
                      WHEN 1 THEN 'UNSIGNED1 BINARY'
                      WHEN 2 THEN 'UNSIGNED2 BINARY'
                      WHEN 4 THEN 'UNSIGNED4 BINARY'
                      WHEN 8 THEN 'UNSIGNED8 BINARY'
                      ELSE 'UNKNOWN UNSIGNED BINARY'
                   END
                WHEN 15 THEN
                   CASE XE$SIZE
                      WHEN 2 THEN 'AUTOINC2'
                      WHEN 4 THEN 'AUTOINC4'
                      ELSE 'UNKNOWN AUTOINC'
                   END
                WHEN 16 THEN 'BIT'
                WHEN 17 THEN 'NUMERICSTS'
                WHEN 18 THEN 'NUMERICSA'
                WHEN 19 THEN 'CURRENCY'
                WHEN 20 THEN 'TIMESTAMP'
                WHEN 21 THEN 'CLOB/BLOB'
                WHEN 25 THEN 'WSTRING'
                WHEN 26 THEN 'WZTRING'
                WHEN 27 THEN 'GUID'
                WHEN 30 THEN 'DATETIME'
                ELSE CONCAT('DATATYPE UNKNOWN = ',
                            CAST(XE$DATATYPE AS CHAR(3)))
             END,
             XE$SIZE, XE$SIZE, XE$DEC,
             CASE
                WHEN XE$FLAGS & 4 = 4   THEN 'YES'
                ELSE 'NO'
             END
    FROM     X$FIELD AS COL, X$FILE AS TAB
    WHERE    COL.XE$FILE = TAB.XF$ID
    AND      TAB.XF$FLAGS & 16 <> 16
    AND      COL.XE$DATATYPE NOT IN (227,255)
    

    If we query this view, we will see that PLAYERNO is still in that second position. The reason is that this COLUMNS view displays the physical position of each column and not the logical one. If we want to change the physical position of a column we have to add the term PSQL_PHYSICAL to the ALTER TABLE statement; see the next example.

    Example 2: Move the column PLAYERNO in the TEAMS table from the second to the last physical position.

    ALTER TABLE TEAMS
       ( PSQL_MOVE COLUMN PLAYERNO TO PSQL_PHYSICAL 3 )
    
    SELECT   *
    FROM     TEAMS
    

    The result is:

    TEAMNO  DIVISION  PLAYERNO
    ------  --------  --------
         1  first            6
         2  second          27
    

    Explanation: This result obviously demonstrates that if we change the physical position, we also change the logical position. If you don’t want the logical position to change, you will need a second ALTER TABLE statement to move it back logically. Note that adding the term PSQL_POSITION has no effect on the statement’s result.

    Example 3: Show the physical and logical positions of all the columns in the MATCHES table after the move described in Example 2.

    SELECT   UPPER(TAB.XF$NAME) AS TABLE_NAME, UPPER(COL.XE$NAME) AS COLUMN_NAME,
            (SELECT COUNT(*)
             FROM   X$FIELD AS F
             WHERE  F.XE$FILE = TAB.XF$ID
             AND    F.XE$OFFSET <= COL.XE$OFFSET
             AND    F.XE$DATATYPE NOT IN (227,255)) AS PHYSICAL_ORDER,
             IFNULL ((SELECT A.XA$ATTRS
                  FROM   X$ATTRIB AS A
                  WHERE  A.XA$TYPE='L'
                  AND    A.XA$ID = COL.XE$ID),
                 (SELECT COUNT(*)
                  FROM   X$FIELD AS F
                  WHERE  F.XE$FILE = TAB.XF$ID
                  AND    F.XE$OFFSET <= COL.XE$OFFSET
                  AND    F.XE$DATATYPE NOT IN (227,255))) AS LOGICAL_ORDER
    FROM     X$FIELD AS COL, X$FILE AS TAB
    WHERE    COL.XE$FILE = TAB.XF$ID
    AND      TAB.XF$FLAGS & 16 <> 16
    AND      COL.XE$DATATYPE NOT IN (227,255)
    AND      UPPER(TAB.XF$NAME) = 'MATCHES'
    

    The result is:

    TABLE_NAME  COLUMN_NAME  PHYSICAL_ORDER  LOGICAL_ORDER
    ----------  -----------  --------------  -------------
    MATCHES     MATCHNO                   1              1
    MATCHES     TEAMNO                    2              2
    MATCHES     PLAYERNO                  3              5
    MATCHES     WON                       4              3
    MATCHES     LOST                      5              4
    

    Explanation: The last three rows in the result show that the logical positions of the last three columns are different than their physical positions.

    To summarize, columns can be moved in a table. If we do that we have to be aware that there is a difference between logical and physical positions.

  • Accessing the DDF Files Through Views

     Date: March 2010

    Like most database servers, Pervasive PSQL keeps track of data on database objects, such as tables, columns, views, and indexes, in so-called DDF files (sometimes called catalog or system tables). These files are automatically created by PSQL during installation and/or when a database is created. This type of data is often called (technical) meta data.

    Sometimes, we need to access the meta data in our applications. It might be that we want to know the names of the columns of a specific table, or the data type and maximum length of a particular column. But how do we query those DDF files? There are four different approaches to query them. First, you can use Pervasive Control Center to view the properties of the database objects. The advantage of this approach is that you don’t have to understand the structure of each DDF file. PCC understands it. Retrieving the meta data is purely a matter of point and click. The disadvantage of this approach is that you can’t use it if you want to query the meta data from within an application.

    The second approach is through a set of system stored procedures supported by Pervasive PSQL. These procedures allow us to display meta data from the DDF files. For example, with the following statement we get a list of all the tables in the current database:

    CALL PSP_TABLES (,,'USER TABLE')

    The result might look like this:

    TABLE_QUALIFIER  TABLE_OWNER  TABLE_NAME         TABLE_TYPE  REMARKS  FILE_LOCATION
    ---------------  -----------  -----------------  ----------  -------  ---------------------
    TENNIS           ?            PLAYERS            USER TABLE  ?        PLAYERS.mkd
    TENNIS           ?            TEAMS              USER TABLE  ?        TEAMS.mkd
    TENNIS           ?            MATCHES            USER TABLE  ?        MATCHES.mkd
    TENNIS           ?            PENALTIES          USER TABLE  ?        PENALTIES.mkd
    TENNIS           ?            COMMITTEE_MEMBERS  USER TABLE  ?        COMMITTEE_MEMBERS.mkd
    

    The disadvantage of this approach is that you can’t query the results of the procedures. These procedures work as predefined queries and the returned results are final results. For example, if we only want to see tables with at least four columns, or with a name that starts with the letter D, that’s not possible. So, it’s not a very flexible approach.

    The third approach is the die-hard approach, where we write our own SELECT statements and query the DDF files directly. In principle, querying the DDF files is like querying ordinary tables. However, their internal structures are somewhat complex and rather cryptic. A lot of codes are used to describe certain properties. In other words, writing queries on DDF files is hard, and you have to understand the internal structures of the DDF files very well.

    The fourth approach is that we define views on the DDF files. Those views will present a more straightforward structure to the applications. For my book The SQL Guide to Pervasive PSQL, I created a number of those views. In a simple and transparent way those views give access to the actual, underlying DDF files. In addition, because they are views, we can write any query on those views to get whatever we want.

    For example, with the following long and complex statement we create a view that shows the properties of all the columns of each table in the current database:

    CREATE   VIEW COLUMNS (TABLE_NAME, COLUMN_NAME, COLUMN_NO, DATA_TYPE,
             CHAR_LENGTH, PRECISION, SCALE, NULLABLE) AS
    SELECT   UPPER(TAB.XF$NAME), UPPER(COL.XE$NAME),
            (SELECT COUNT(*)
             FROM   X$FIELD AS F
             WHERE  F.XE$FILE = TAB.XF$ID
             AND    F.XE$OFFSET <= COL.XE$OFFSET
             AND    F.XE$DATATYPE NOT IN (227,255)), 
             CASE XE$DATATYPE
                WHEN 0 THEN 'STRING'
                WHEN 1 THEN
                   CASE XE$SIZE
                      WHEN 1 THEN 'TINYINT'
                      WHEN 2 THEN 'SMALLINT'
                      WHEN 4 THEN 'INTEGER'
                      WHEN 8 THEN 'BIGINT'
                      ELSE 'UNKNOWN INTEGER'
                   END
                WHEN 2 THEN
                   CASE XE$SIZE
                      WHEN 4 THEN 'REAL'
                      WHEN 8 THEN 'DOUBLE'
                      ELSE 'UNKNOWN REAL'
                   END
                WHEN 3 THEN 'DATE'
                WHEN 4 THEN 'TIME'
                WHEN 5 THEN 'DECIMAL'
                WHEN 8 THEN 'NUMERIC'
                WHEN 9 THEN
                   CASE XE$SIZE
                      WHEN 4 THEN 'BFLOAT4'
                      WHEN 8 THEN 'BFLOAT8'
                      ELSE 'UNKNOWN BFLOAT'
                   END
                WHEN 11 THEN 'VARCHAR'
                WHEN 14 THEN
                   CASE XE$SIZE
                      WHEN 1 THEN 'UNSIGNED1 BINARY'
                      WHEN 2 THEN 'UNSIGNED2 BINARY'
                      WHEN 4 THEN 'UNSIGNED4 BINARY'
                      WHEN 8 THEN 'UNSIGNED8 BINARY'
                      ELSE 'UNKNOWN UNSIGNED BINARY'
                   END
                WHEN 15 THEN
                   CASE XE$SIZE
                      WHEN 2 THEN 'AUTOINC2'
                      WHEN 4 THEN 'AUTOINC4'
                      ELSE 'UNKNOWN AUTOINC'
                   END
                WHEN 16 THEN 'BIT'
                WHEN 17 THEN 'NUMERICSTS'
                WHEN 18 THEN 'NUMERICSA'
                WHEN 19 THEN 'CURRENCY'
                WHEN 20 THEN 'TIMESTAMP'
                WHEN 21 THEN 'CLOB/BLOB'
                WHEN 25 THEN 'WSTRING'
                WHEN 26 THEN 'WZTRING'
                WHEN 27 THEN 'GUID'
                WHEN 30 THEN 'DATETIME'
                ELSE CONCAT('DATATYPE UNKNOWN = ', CAST(XE$DATATYPE AS CHAR(3)))
             END,
             XE$SIZE, XE$SIZE, XE$DEC,
             CASE
                WHEN XE$FLAGS & 4 = 4   THEN 'YES'
                ELSE 'NO'
             END
    FROM     X$FIELD AS COL, X$FILE AS TAB
    WHERE    COL.XE$FILE = TAB.XF$ID
    AND      TAB.XF$FLAGS & 16 <> 16
    AND      COL.XE$DATATYPE NOT IN (227,255)
    

    Let’s create a new table to show what a query on the above view looks like.

    CREATE TABLE PLAYERS (
       PLAYERNO      INTEGER NOT NULL,
       NAME          CHAR(15) NOT NULL,
       INITIALS      CHAR(3) NOT NULL,
       BIRTH_DATE    DATE,
       SEX           CHAR(1) NOT NULL,
       JOINED        SMALLINT NOT NULL,
       STREET        VARCHAR(30) NOT NULL,
       HOUSENO       CHAR(4),
       POSTCODE      CHAR(6),
       TOWN          VARCHAR(30) NOT NULL,
       PHONENO       CHAR(13),
       LEAGUENO      CHAR(4),
       PRIMARY KEY   (PLAYERNO))
    

    Now, query the COLUMNS view:

    SELECT   *
    FROM     COLUMNS
    WHERE    TABLE_NAME = 'PLAYERS'
    

    The result is:

    TABLE_NAME  COLUMN_NAME  COLUMN_NO  DATA_TYPE  CHAR_LENGTH  PRECISION  SCALE  NULLABLE
    ----------  -----------  ---------  ---------  -----------  ---------  -----  --------
    PLAYERS       SEX                  5  STRING               1          1      0  NO 
    PLAYERS	     BIRTH_DATE           4  DATE                 4          4      0  YES
    PLAYERS	     JOINED               6  SMALLINT             2          2      0  NO 
    PLAYERS	     INITIALS             3  STRING               3          3      0  NO 
    PLAYERS	     NAME                 2  STRING              15         15      0  NO 
    PLAYERS	     PLAYERNO             1  INTEGER              4          4      0  NO 
    PLAYERS	     STREET               7  VARCHAR             31         31      0  NO 
    PLAYERS       HOUSENO              8  STRING               4          4      0  YES
    PLAYERS	     POSTCODE             9  STRING               6          6      0  YES
    PLAYERS	     TOWN                10  VARCHAR             31         31      0  NO 
    PLAYERS	     PHONENO             11  STRING              13         13      0  YES
    PLAYERS	     LEAGUENO            12  STRING               4          4      0  YES
    

    Here is another example: How many integer columns does the PLAYERS table have?

    SELECT   COUNT(*)
    FROM     COLUMNS
    WHERE    TABLE_NAME = 'PLAYERS'
    AND      DATA_TYPE IN ('TINYINT', 'SMALLINT', 'INTEGER', 'BIGINT')
    

    The result is:

    COUNT(*)
    -------
          2
    

    With the first three approaches this query would not have been possible. And can you imagine how complex this query would have been if we had written it directly on the DDF files? Now, it’s not that complex at all, and the query is quite readable.

    An additional advantage of the fourth approach over the one where developers write their own queries directly on the DDF files, is that not all the developers have to understand the structures of the DDF files, only the ones who write the view. And the chance that someone makes a mistake with understanding how it works is minimized.

    The view definition shown above is just an example. You can change it anyway you want. For example, if you prefer different names for the data types, or if you don’t want to show the table and column names in capitals, just change the CREATE VIEW statement. But hopefully this view is an acceptable starting point.

    If your applications need to access the meta data in the DDF files in a simple way, we recommend you create your own views, move the logic that understands the DDF files from the applications to the view definitions.

    Finally, if you can improve or extend the definition of the above view, or of one of the view definitions in my book, please let me know. You can send you improvements to sql@r20.nl.

  • Extra Conditions in the Join

    January 2009

    The condition in the FROM clause of a SELECT statement is primarily meant to be used for joining tables. However, other conditions are allowed here even if they don’t really belong to the join. What you should know though is that moving a condition from the WHERE clause to the join condition (or vice versa) can actually affect the result of a query. This article explains how Pervasive PSQL handles these extra conditions.

    Let’s begin with an example. We will use the following two tables, TEAMS and PENALTIES. Both have a column called PLAYERNO on which those tables can be joined. These tables are part of the standard sample used in the book The SQL Guide to Pervasive PSQL. Note that this article is based on Section 7.11 of this book.

    The next SELECT statement contains a left outer join plus an additional condition in the WHERE clause.

    SELECT   TEAMS.PLAYERNO, TEAMS.TEAMNO, PENALTIES.PAYMENTNO
    FROM     TEAMS LEFT OUTER JOIN PENALTIES
             ON TEAMS.PLAYERNO = PENALTIES.PLAYERNO
    WHERE    DIVISION = 'second'

    The end result of the query is:

    PLAYERNO  TEAMNO  PAYMENTNO
    --------  ------  ---------
          27       2          3
          27       2          8
    

    How does Pervasive PSQL process this statement? Logically, the following four steps are executed in sequence.

    Step 1: A Cartesian product of the two tables is created. In other words, each row from the TEAMS table is linked to all the rows from the PENALTIES table. In this example that means that the intermediate result of this step consists of 2 times 8 is 16 rows.

    Step 2: In this step, all the conditions in the join condition are applied to the intermediate result of step 1. The condition in this query contains only one condition: TEAMS.PLAYERNO = PENALTIES.PLAYERNO. The intermediate result of this operation contains all the rows of the TEAMS table of which the player number appears in the PENALTIES table. This is the result (on the left are the columns of the TEAMS table, and on the right are those of the PENALTIES table):

    TEAMNO PLAYERNO DIVISION PAYNO PLAYERNO PAYMENT_DATE AMOUNT
    ------ -------- -------- ----- -------- ------------ ------
         1        6 first        1        6 1980-12-08   100.00
         2       27 second       3       27 1983-09-10   100.00
         2       27 second       8       27 1984-11-12    75.00
    

    Step 3: In this step, Pervasive PSQL checks whether an outer join has been specified and whether certain rows are missing and have to be returned. In this example, if teams disappear from this join, they must be returned because of the left outer join specification. However, all the teams in the TEAMS table (1 and 2) are in the intermediate result. So, nothing is missing and nothing has to be returned.

    Step 4: In this step, the above result is passed to the next clause and that clause is processed which, in this example, is the WHERE clause. The WHERE clause in this SELECT statement contains only one condition. It leads to removing all the rows in which the DIVISION column is not equal to ‘second’. The result is:

    TEAMNO PLAYERNO DIVISION PAYNO PLAYERNO PAYMENT_DATE AMOUNT
    ------ -------- -------- ----- -------- ------------ ------
         2       27 second       3       27 1983-09-10   100.00
         2       27 second       8       27 1984-11-12    75.00
    

    Only the two rows above are passed on to the next clause of the SELECT statement, which in this example is the SELECT clause.

    So, logically, this is how we should think about how Pervasive PSQL processes this query. Internally, of course, it will optimize the whole process to avoid, for example, having to execute resource intensive Cartesian products.

    Now the question is if it would make a difference whether we move the condition specified in the WHERE clause to the join condition? Will we get a different result? If we move the condition to the join condition, the following statement arises:

    SELECT   TEAMS.PLAYERNO, TEAMS.TEAMNO, PENALTIES.PAYMENTNO
    FROM     TEAMS LEFT OUTER JOIN PENALTIES
             ON TEAMS.PLAYERNO = PENALTIES.PLAYERNO
             AND DIVISION = 'second'
    

    Let’s go through those four steps again.

    Step 1: Create a Cartesian product of the tables involved. This will lead to the same intermediate result as in the previous statement. It will consist of 16 rows.

    Step 2: Process all the conditions in the join condition. This leads to the following intermediate result:

    TEAMNO  PLAYERNO  DIVISION  PAYNO  PLAYERNO  PAYMENT_DATE AMOUNT
    ------  --------  --------  -----  --------  ------------ ------
         2        27  second        3        27  1983-09-10   100.00
         2        27  second        8        27  1984-11-12    75.00
    

    As you can see, this result is different: only two rows instead of three.

    Step 3: Check for missing rows. Pervasive PSQL checks whether rows from the TEAMS table (because that is the table on the left of the left outer join) are missing from this intermediate result. Team 1 does not appear in this intermediate result because it does not play in the second division. Those rows have to be returned again. As a result, team 1 is added again, which leads to the following result (the question mark represents a null value):

    TEAMNO  PLAYERNO  DIVISION  PAYNO  PLAYERNO  PAYMENT_DATE AMOUNT
    ------  --------  --------  -----  --------  ------------ ------
         2        27  second        3        27  1983-09-10   100.00
         2        27  second        8        27  1984-11-12    75.00
         1         6  first         ?         ?  ?                 ?
    

    Step 4: Because of the absence of a WHERE clause, no rows are removed, and the intermediate result above is passed to the next clause. The final result of the query looks like this:

    PLAYERNO  TEAMNO  PAYMENTNO
    --------  ------  ---------
           6       1          ?
          27       2          3
          27       2          8
    

    The result of this statement clearly differs from that of the previous statement. Conclusion, it does matter where we specify a condition.

    Let’s use another example. The next SELECT statement contains a full outer join plus an additional condition in the FROM clause.

    SELECT   TEAMS.PLAYERNO, TEAMS.TEAMNO, PENALTIES.PAYMENTNO
    FROM     TEAMS FULL OUTER JOIN PENALTIES
             ON TEAMS.PLAYERNO = PENALTIES.PLAYERNO
             AND TEAMS.PLAYERNO > 1000
    

    The result is:

    PLAYERNO  TEAMNO  PAYMENTNO
    --------  ------  ---------
           ?       ?          3
           ?       ?          8
           ?       ?          1
           ?       ?          6
           ?       ?          2
           ?       ?          5
           ?       ?          7
           ?       ?          4
           6       1          ?
          27       2          ?
    

    After processing steps 1 and 2 the join has been processed and the intermediate result is empty. The reason is that there are no player numbers greater than 1000. Then, during step 3, Pervasive PSQL checks whether rows exist in the tables TEAMS and PENALTIES that don’t appear in the result. Well, all the teams and all the penalties are missing, so they have to be added again. This leads to the above end result.

    Conclusion: If an outer join is used, it does matter whether particular conditions are placed in the join condition or in the WHERE clause. Therefore, consider carefully where you want to place them. Note: this does not apply to inner joins but only to outer joins.

     

  • Subqueries Everywhere

     Date: December 2009

    Most developers already know how to use Pervasive PSQL’s SELECT statement, or how to write queries. And I believe them when they say that. But do they really understand all the features, and do they really use all the capabilities? By talking to many developers over the years I’ve discovered that in most cases that’s not the case. Most developers restrict themselves to using a subset of this statement, and therefore they are limiting themselves. As a result the programs become unnecessarily complex, and they are not as efficient as they could be. It would be comparable to buying a house, and restricting yourself by using just a few of the rooms. That wouldn’t make sense. The better a developer understands SQL, the more efficient his or her programs will become and the better Pervasive PSQL is exploited. This and coming blogs will make developers more aware of the features of SQL as implemented in Pervasive PSQL. This first one is devoted to subqueries.

    Subqueries have been part of Pervasive PSQL's SQL dialect for a long time now. But in the beginning subqueries could only be used within the WHERE and HAVING clauses of the SELECT, UPDATE, and DELETE statements. Nowhere else could you specify a subquery. In fact, its use was constrained to selecting rows from intermediate results. Currently, it is possible to specify subqueries almost anywhere, also in the SELECT clause. Besides making the SQL dialect more orthogonal, it also simplifies writing certain queries. Through a few examples, this article shows the power of using subqueries in the SELECT clause.

    By the way, the word orthogonal in the context of programming languages means, amongst other things, that the number of odd exceptions is reduced to a minimum.

    To illustrate the use of subqueries in the SELECT clause, we use the three tables presented in Figure 1. In these three tables the columns PLAYERS.PNO, PENALTIES.PENNO, and TEAMS.TEAMNO represent the primary keys. The two columns PENALTIES.PNO and TEAMS.PNO are foreign keys pointing to the primary key of the PLAYERS table. Note: This example is derived from the standard example used throughout the book The SQL Guide to Pervasive PSQL.

     Figure 3: The Three Sample Tables

    Let’s start this discussion with an example.

    Example 1: For each player get the name and the number of his or her penalties. Without subqueries the query would look as follows:

    SELECT   PLAYERS.PNO, PLAYERS.NAME, COUNT(PENALTIES.PENNO)
    FROM     PLAYERS LEFT OUTER JOIN PENALTIES
             ON (PLAYERS.PNO = PENALTIES.PNO)
    GROUP BY PLAYERS.PNO, PLAYERS.NAME
    

    The result is:

    PNO  NAME       COUNT(PENALTIES.PENNO)
    ---  ---------  ----------------------
      1  Metheny                         2
      2  Pastorius                       1
      3  Zawinul                         0
    

    If you use subqueries within the SELECT clause instead, the query would look differently:

    SELECT   PLAYERS.PNO, PLAYERS.NAME,
            (SELECT   COUNT(PENALTIES.PENNO)
             FROM     PENALTIES
             WHERE    PENALTIES.PNO = PLAYERS.PNO)
    FROM     PLAYERS
    

    What is special about this solution is that the main query does not contain a join, but its SELECT clause contains a subquery. This subquery is correlated because it contains a reference to the main query (PLAYERS.PNO). That means we have to assume that for each row in the PLAYERS table Pervasive PSQL processes this subquery. And because it contains a correlation, it might return a different result for each row. This subquery counts for each player separately his or her number of penalties.

    In the solution with the join, we have to be sure we specify the correct join, in this case a left outer join. If we hadn’t done that, but had used an inner join instead, we would not have seen player 3 in the end result, because that player doesn’t have penalties. In the second solution, this is not an issue. The subquery will automatically return the right result. If a player has no penalties, the subquery will return the value zero, which is exactly what we want.

    Is this query really easier to formulate than the previous one? Probably not, they are comparable in complexity. However, an advantage of the second solution is that the query resembles the original question somewhat more, and that makes it a little easier to write and read. But that is not a very strong argument for using the second solution. However, there are queries that are only easy to formulate if this construct is used. Let’s demonstrate that with another example.

    Example 2: For each player get the name, the number of penalties, plus the number of times this player appears in the TEAMS table.

    If we would extend the first query to cater for this extension, we would get this solution:

    SELECT   PLAYERS.PNO, PLAYERS.NAME, COUNT(PENALTIES.AMOUNT), COUNT(TEAMS.TEAMNO)
    FROM    (PLAYERS LEFT OUTER JOIN PENALTIES
             ON (PLAYERS.PNO = PENALTIES.PNO)) 
                 LEFT OUTER JOIN TEAMS
                 ON (PLAYERS.PNO = TEAMS.PNO)
    GROUP BY PLAYERS.PNO, PLAYERS.NAME
    

    The result is:

    PNO  NAME       COUNT(PENALTIES.AMOUNT)  COUNT(TEAMS.TEAMNO)
    ---  ---------  -----------------------  -------------------
      1  Metheny                          2                    0
      2  Pastorius                        2                    2
      3  Zawinul                          0                    1
    

    The two joins make this query hard to read. But more importantly, this result might look correct, but it isn’t. The second row shows that player 2 has two penalties, which is not true; he has only one penalty. The problem is that player 2 appears once in PENALTIES and twice in TEAMS. The join results in a multiplication of rows: 1 times 2, and as a result the number of penalties is too high.

    With subqueries the query would look different and would return a correct result:

    SELECT   PLAYERS.PNO, PLAYERS.NAME,
            (SELECT   COUNT(*)
             FROM     PENALTIES
             WHERE    PENALTIES.PNO = PLAYERS.PNO),
            (SELECT   COUNT(*)
             FROM     TEAMS
             WHERE    TEAMS.PNO = PLAYERS.PNO)
    FROM     PLAYERS
    

    The result is:

    PNO  NAME       COUNT(PENALTIES.AMOUNT)  COUNT(TEAMS.TEAMNO)
    ---  ---------  -----------------------  -------------------
      1  Metheny                          2                    0
      2  Pastorius                        1                    2
      3  Zawinul                          0                    1
    

    This result is correct, you can check that yourself if you want to. In addition, the solution is quite straightforward. This is a clear example of where the use of subqueries in the SELECT clause is the only correct solution. We could not have written this query using joins.

    There are other examples as well that show that when you allow the use of subqueries in the SELECT clause, it makes life easier for developers.

    Example 3: For each penalty get the difference between the amount and the average amount. 

    SELECT   PENNO, PNO,
             ABS(AMOUNT - (SELECT AVG(AMOUNT) FROM PENALTIES)) AS DIFF
    FROM     PENALTIES
    
    The result is:
    PENNO  PNO  DIFF
    -----  ---  ----
        1   1     10
        2   1      0
        3   2     10
    
     
    Example 4: For each penalty show the amount plus the cumulative amount. 
     
    SELECT   PENNO, AMOUNT,
            (SELECT SUM(AMOUNT)
             FROM  PENALTIES AS SUB_PEN
             WHERE  SUB_PEN.PENNO <= MAIN_PEN.PENNO) AS CUM_TOTAL
    FROM     PENALTIES AS MAIN_PEN
    

    The result is:

    PENNO  AMOUNT  CUM_TOTAL
    -----  ------  ---------
        1      60         60
        2      50        110
        3      40        150
    

    Summary: Using subqueries in the SELECT clause really does extend the power of Pervasive PSQL’s SQL dialect. It reduces the complexity of certain queries, and it can lead to SELECT statements that are easy to write, understand, and maintain. If you haven’t been using this feature yet, I strongly recommend you to get comfortable with it.

More Posts