In the previous article I have described how tables are joined together in SQL. In this article I will get into more advanced SQL stuff. In SQL, there are many ways how one problem can be solved. In procedural languages (like PHP), there are also many ways how the same procedure can be written. But these are mostly cosmetic changes, which do not lead to a completely different code. Because SQL is declarative language, there are many conceptually different ways how a task can be described. And each of these ways may or may not lead to a different sequence of steps executed on the data, which may or may not lead to identical results.
Because of this, it is crucial for you to understand what the SQL query is really doing and why it returns the data it returns. In this article, I will try to explain some of the common approaches and the differences between them. Honestly, this is quite difficult and requires a lot of thinking. In the second part of the article I will explain what aggregation in SQL is and what it is good for.
A sub-query is a SQL query executed within another query. Because SQL is a non-procedural language, there is really no way to chain results from multiple queries together or to apply a sequence of queries to data (apart from using temporary tables, which is ugly). In SQL, sub-queries are used for that. There are multiple options where a sub-query can be used:
IN
operator, the sub-select must return only a single column.Sub-queries can be used in any SQL command (SELECT, UPDATE, DELETE and with limitations even in INSERT). However, they are most commonly used within a SELECT query. Also note that it is not possible to use UPDATE, DELETE or INSERT commands as sub-queries, because they do not return a value.
Let’s say, you want to list all persons in the database who have an email.
If you have followed the previous article about joins, you should be able to write
it using two JOIN
s. Try to do it by yourself as an exercise:
One of the two above solutions is roughly what you should have arrived at. I haven’t precisely specified
the columns you should select or the ordering, because it does not really matter. The important part
is that you have joined together tables person
with contact
and then added contact_type
(joins to contact
)
or you can join the contact
and contact_type
first and then join person
to them (again joins to contact
).
Because I want only persons with email, a simple inner join should be used and no left or right is
necessary. The two above queries are equivalent, the second one is not so verbose, but it is still
unambiguously explicit.
An alternative to the above approach is to use a sub-query to contain the %mail
condition.
If you want to understand a nested query, you have to start from the inside. The inner-most query is this one:
You can run the sub-query as a standalone query. On the sample database, it will
give you a table with a single column id_contact_type
and a single value 4
.
id_contact_type |
---|
4 |
Now that you understand the innermost query, you can jump one level up:
The above query takes the contact
table and joins it with the type_email
table which is defined by the inner query.
Notice that each sub-query must be assigned an alias – AS type_email
in this case – which enables you
to reference the inner query in the outer query. The outer query cannot reference tables from the inner query
– contact_type
table is not accessible in the outer query above.
You can see the contents of the type_email
table above. Because I have used an INNER JOIN, the query will
contain only the rows that have the contact.id_contact_type
value equal to those in the type_email
table
and since that is only one value, it is essentially the same as writing contact.id_contact_type = 4
.
Therefore the above query essentially selects all contacts that have the type matching %email
.
Now that you know what the inner queries do, you can again look at the entire query:
The principle is the same, I have assigned alias contact_email
to the inner query, which contains
all emails in the database. The result table is INNER
joined with persons, which yields
only the persons having an email.
Although this approach probably looks much more complicated then the first one, it is used widely and for complicated queries it is in fact easier (because it allows you to run parts of the query isolated).
The IN operator is used as part of the WHERE
clause to check if a value of
a column is in a set of values.
Again, there are two sub-queries which you can run standalone. The innermost query is the same as in the previous example with sub-queries. However this entire query is simpler as there are no joins. This in turn means that it is not possible to display results from multiple tables. This is a limitation of this approach.
Notice that to be usable in an IN
operator, the sub-query must return only a single
column (also it wouldn’t make much sense to search for id_contact_type
in name
column).
Also notice that I have used the DISTINCT
keyword in both sub-queries to ensure that the
result is a set. Depending on the database server (and version), this may or may not be
necessary.
The fourth approach uses the EXISTS
operator which is used in the WHERE
clause and
checks if (for each row) a row exists in the sub-query.
Because this probably looks a little complicated, lets limit this to the middle query:
The query selects from the contact
table rows for which a corresponding
row exists in the sub-query and meets the condition person.id_person = contact.id_person
.
The inner query selects 1
(which is a column full of ones), because the EXISTS
operator
checks only if a row is returned (not for the contents of the row). So it does not
matter what the inner query selects, but it must select something.
The the inner query has the condition that matches the email and
contact.id_contact_type = contact_type.id_contact_type
. This probably looks strange, because
the query selects from the table contact_type
, yet it references the table contact
as well.
This is called a correlated sub-query. In SQL, a query cannot reference tables inside a
sub-query (it can only use its result), but a sub-query can reference its parent query.
A correlated sub-query must be evaluated together with its parent query.
The EXISTS
operator can always be replaced by some other approach. I would also discourage you
from using correlated sub-queries as they are usually both inefficient and incomprehensible.
But there are some rare cases where they must be used, so you should know that something like
that is possible.
The examples above show how the same query (retrieving persons with e-mails) can be written using four distinctively different approaches:
IN
operator and sub-queries to select sets.EXISTS
operator and correlated sub-queries.You have already learned the first one in the article about joins.
The second approach with sub-queries is a lot more complicated (and abstract), but it is really
important. There are many problems which cannot be solved without using sub-queries.
The third approach using the IN
operator use-cases can be solved by the first two approaches, but
it is usually much simpler, efficient and elegant (if it can be used). So you should really learn
to use it, because it is really worth it.
The fourth approach using EXISTS
is not really that important as its use-cases can be solved by the
other approaches. Besides, I don’t like it. Of course, all the approaches can be freely combined,
so a personal favorite of mine would probably be:
If you want to display data from multiple tables, you have to use JOIN
. Therefore the last two
approaches do not allow you to select a contact value. Also notice that you can
freely combine any of the approaches together.
In the previous example I have shown how a single problem can be solved with multiple approaches in SQL. In this example I want to show you that the choice of an approach is not at all cosmetic and has quite deep implications.
Lets start with the following query, which selects all persons who have an address and that address is in Brno city, i.e. all persons living in Brno.
The following query returns exactly the same result. It uses a sub-query
which returns a table named location_brno
which contains only addresses in Brno
and then joins this table to persons.
The following query is made out of the 1A query. I have added LEFT
to the
join and a condition OR (city IS NULL)
. If I add only the LEFT
, the
result of the query will not change at all.
The LEFT
join will add people who do not have any address, but those people
also do not have city = 'Brno'
therefore they would not be returned.
One option is to add the condition city IS NULL
to return persons without a city.
Therefore the following query will return all persons living in Brno plus
those who have no address.
The following query is made out of 1B query. It uses the sub-query and
I have added a LEFT
to its join. Now this behaves differently to the
2A query right above. The sub-query contains only addresses in
Brno and there is no search condition in the parent query. This means that the
LEFT
join will select all persons and show their addresses provided that
they are in Brno.
An alternative approach is to put the search condition inside the join condition:
The above query is a nice example of using a join in which the join condition is not a simple equality. The above query returns the same result as Query 2B because it selects all persons from the left table and adds only locations matching to each person. The matching locations must meet the condition that they are both assigned to the person and also have city equal Brno. Therefore queries 2B and 3A are equivalent.
Although there is nothing technically wrong with this approach, I wouldn’t recommend it to you. Because the join condition and the search condition are merged together, it has disadvantages of joining tables with the cartesian product. Specifically, it makes things harder to debug and more prone to very serious errors. A mistake in the search condition or in the logical operators can easily lead to nonsense results.
In this example I want to show you how subtle changes in the requirement may lead to radical changes in the SQL query and vice versa. You have to be very careful when writing SQL queries – you need to understand the requirement very well and be very precise when implementing it.
It also shows one unfortunate property of SQL. It is easy to run into dead-ends. For example, there is no way to turn the query 1A into returning the 2B result (without using a peculiar join condition). So if you had a requirement 1, which turned into requirement 2 and you initially solved it with query 1A, better rewrite the whole thing.
In SQL aggregation is an operation which merges two or more rows (entities) into one. It has no relation to Aggregation in OOP. Aggregation usually needs an aggregation function (SUM, MAX, AVG, …) which performs the merge.
By definition, aggregation causes loss of information. On the other hand, it enables to gain an overview / high-level information, which would otherwise be difficult to extract. It is used when you are not interested in all details:
Aggregation typically contains a what and a how. When a what can be measured, it is called a metric. An example of aggregation function, which is not a metric is e.g. string_agg. A how defines how that metric should be computed.
The operation or view which allows you to view the details that lead to a particular value of a metric is called drill-down. For example, if you want to know why on March 19, Kate had sales of $9684, you drill-down and look at the list of receipts for that day and that salesperson. All this leads to Business Intelligence, which is far beyond the scope of this book. All I want to show now is what aggregation is used for.
All database servers have some commonly used aggregation functions build in:
COUNT
, SUM
, MAX
, MIN
, AVG
, STDDEV
. Then each database server can have many other
database functions, depending on what was implemented in them. For example, for PostgreSQL,
you can find a current list in the manual.
An aggregation function takes a column (or column_expression) as an argument. For example you can run the following queries on the sample database:
SELECT COUNT(*) FROM person
– number of rows in the person
table (49).SELECT COUNT(height) FROM person
– number of values in the height
column (40).SELECT COUNT(DISTINCT height) FROM person
– number of unique values in the height
column (26).Note that combining DISTINCT
and *
makes no sense. All of the above queries return a single
row – they aggregate the entire table. This is not that useful, because the
metric (or what) is too coarse grained.
To control how the metric is computed,
you need grouping.
SELECT [ ALL | DISTINCT ] column_expression, ... FROM table_expression [ WHERE search_condition ] [ GROUP BY column_expression [, ... ] ] [ HAVING search_condition ] [ ORDER BY { column_expression [ ASC | DESC ] } [, column_expression [ASC | DESC ], ... ]
The GROUP BY
clause is used to divide the input table (everything in the FROM
clause)
into groups. The aggregation function is then applied to each group (the metric is
computed for each group separately). The result of aggregation contains one row per group
(the metric has a single value for each group).
The grouping is defined by the values of a column or columns (or column_expressions). Each distinct value (or combination of values) identifies one group. Although the column is typically ID of an entity, it can be anything.
To get a number of emails for each person having a contact, I will use
the contact
as the source table:
id_contact | id_person | id_contact_type | contact |
---|---|---|---|
2 | 1 | 4 | john@example.com |
3 | 1 | 4 | johnny@example.com |
8 | 1 | 4 | master@example.com |
4 | 2 | 4 | jill@example.com |
6 | 2 | 4 | jj@example.com |
7 | 6 | 4 | tom@example.com |
The metric is number of contacts (therefore I will use the
COUNT
function). The how of the metric is for each person
, therefore
I need to define a separate group for each person, which I can do by
GROUP BY id_person
. Since id_person
is unique for each person, each
group containing a single person id, is guaranteed to contain data
related only to that person.
The complete query is:
The COUNT
function is applied to each group identifier by id_person
values.
It returns a single value for each group, so the output of the query is:
id_person | COUNT |
---|---|
1 | 3 |
2 | 2 |
6 | 1 |
Now there may be some mental twists you need to do here to get into aggregation.
First, notice that I have not used COUNT(contact)
to count contacts. What
I am counting is defined by the source data, not by what I specified in the
argument of the COUNT
function.
This is an important thing to be kept in your mind. For example if you wanted to
select number of days it takes for me to reply to an email
. First, you would have to
obtain a table where each row represents a day you haven’t replied to a
particular email (which is a difficult task!) or use a different
aggregation function (which would be a better solution in this case).
Second (and hopefully this is obvious to you), you need to be aware of
what is unique and where it is unique. It is therefore important to
know what keys
are defined on the tables you are using. For example
in the contact
table, the id_person
column is not unique
(because each person can have multiple contacts), but it is still unique for
each person. Doing a group by on the id_contact
column in the contact
table makes no sense.
Things get more complicated when aggregating columns with
possible NULLs.
For example:
Select the number of emails for each person (including those persons not having any email)
.
A Non-solution is the following query:
The above query is a non-solution because it does not return persons without an
email. Eventhough it uses a LEFT JOIN
it won’t select persons without email
(id_contact_type = 4) because they do not satisfy that condition
(they don’t have any contact which would match it).
To fix this, I need to move the search condition somewhere else, and if you followed the previous example, you should be able to come with both solutions yourself.
Because the sub-query approach is better, I will stick to it. However the query is still incorrect. It now lists all persons (with each person listed only once), but none of the persons have count = 0. This is simply incorrect because there are persons who do not have an email. This — by the way — is a very nice exercise too — select all persons which have no email.
The above aggregation query returns 1 for each person because it uses COUNT(*)
and that
counts the number of rows (for each group), which is one. To count the number of contacts, you
need to count the number of values in a column with contacts. You need to use any column
of the contact
table which is guaranteed to be NOT NULL
. Then, the COUNT
function will
return 1 only for persons which have a contact (no values in the contact
table columns are
are null for them) and 0 only for persons which do not have any
contact (all values in the contact
table columns are null for them). All of them are, and therefore
you can use any column of the contact
table. The true solution therefore is:
Notice that the solution heavily depends on how the columns are defined in the database, so you must be able to understand the table definition. Also counting (or summing) anything properly requires that you wrap your head around the aggregation principle. Hopefully the next example will help you with that.
Consider the following two queries. Can you describe what they return?
The first query returns the number of contacts for each person that has a contact (e.g. John has 4 contacts). The second query returns the number of contacts for each contact type (e.g. There are 14 emails in the database).
When writing aggregation queries, you will often run into the following error:
Column XY must appear in the GROUP BY clause or be used in an aggregate function.
You need to understand the aggregation principle fully to resolve the error correctly. Consider for example the following query:
It will return a table similar to:
id_contact_type | count |
---|---|
1 | 21 |
2 | 24 |
3 | 27 |
4 | 37 |
Now what will happen if you add id_person
column to that table?
What do you expect to be in the id_person
column? Each row of the above table
aggregates multiple persons together, therefore there is no single value that could
be shown in the id_person
column. Therefore, you will get the error:
Error in query: ERROR: column "contact.id_person" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT id_person, id_contact_type, COUNT(contact) FROM
The solution of the problem depends on what you really want to do. If you want to show
the number of contacts for each person and the contact type, you need to add the
id_person
column to the GROUP BY
clause:
This will however yield an entirely different result:
id_person | id_contact_type | count |
---|---|---|
1 | 4 | 1 |
1 | 2 | 1 |
2 | 4 | 1 |
2 | 2 | 1 |
… | … | … |
47 | 4 | 3 |
… | … | … |
If you really want to display all persons for each contact type in a single line,
you need to use some kind of aggregation to do that. For example on PostgreSQL, you
can use the array_agg
function:
Which will give you a table like this:
id_person | id_contact_type | count |
---|---|---|
{8,9,10,11,12,13,14,15,16,17,20,21,22,23,24,28,29,30,32,33,34} | 1 | 21 |
{1,2,3,4,5,6,7,8,9,10,21,22,23,24,25,26,27,28,29,30,31,32,33,34} | 2 | 24 |
{8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34} | 3 | 27 |
{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,47,47,47} | 4 | 37 |
Another possibility is that you want to display every person and her contact and display a total number of contacts for each of the contact type.
Which will give you a table like this:
id_person | id_contact_type | count |
---|---|---|
1 | 4 | 37 |
1 | 2 | 24 |
2 | 4 | 37 |
2 | 2 | 24 |
… | … | … |
As you can see, there are many completely different ways how a query can be written and what it returns. Choosing the right solution really depends on what your goals are. You will probably often realize that what you initially intended does not really make much sense and you need to go back to the drawing board.
Now consider the query below. Does it make sense? Will it execute?
The answer depends on the database server you use. A more intelligent one
will execute the query without an error. A true solution to this query is to
use the GROUP BY person.id_person, person.first_name
.
Again it is very important to understand the database structure. Both the id_person
and first_name
columns come from the person
table. id_person
is a key and
therefore it is unique for each person. If you combine it with any other
column from the persons table, you cannot obtain any more unique combination than that.
If the database server is clever enough, it will simply assume, that the
person.first_name
is part of the GROUP BY
statement as well. If it is not, you have to
tell it explicitly. In either case, the query result remains the same and you can display
additional columns to the group by. Keep in mind that it depends on the columns though,
you cannot use this approach e.g. for the contact.contact
column.
The HAVING
clause of an SQL query contains the search condition applied on the
aggregated table.
SELECT [ ALL | DISTINCT ] column_expression, ... FROM table_expression [ WHERE search_condition ] [ GROUP BY column_expression [, ... ] ] [ HAVING search_condition ] [ ORDER BY { column_expression [ ASC | DESC ] } [, column_expression [ASC | DESC ], ... ]
Consider the following query, which selects the number of emails (contact type 4) for all persons that have an email, and returns only those persons that have more than one email.
The above query will give you:
id_person | nickname | count |
---|---|---|
47 | Francis | 3 |
You cannot put the condition COUNT(contact.id_contact) > 1
in the WHERE
clause
because the aggregation has not been done there yet. Analogously, you cannot put the
contact.id_contact_type = '4'
condition to the HAVING
clause because the
column is not available there anymore.
If you find the HAVING
clause confusing, you can always work around it with a subquery.
The following query is equivalent to the above one:
If you read the definition properly, you may have noticed that
GROUP BY
clause allows you to use a column expression
(not only a column name).
The following query takes advantage of that:
Which will give you:
height_range | count |
---|---|
15 | 1 |
16 | 12 |
17 | 14 |
18 | 10 |
19 | 3 |
NULL | 9 |
Because height
is an integer column and 10
is also an integer, the expression
height / 10
is an integer division. This means that 150 / 10 = 15
and
156 / 10 = 15
as well. If you group by all the people into groups by this
value, then you effectively divide them by their height into groups:
Then you can see how many people fall into each group, which in fact is a
histogram of people’s height. You can pimp the result by using
the CASE - WHEN
statement:
There are two use cases for limiting the number of query results – pagination and showing top x results. Other uses are usually wrong, because they fail to deliver some information to end-user. In either case you must always use ORDER BY when limiting the number of rows.
On PostgreSQL use the keywords LIMIT
and OFFSET
to select a portion
of results – e.g. to display the second page of results paged by 10 records:
See the corresponding part of the walkthrough for an example of implementation of entire pagination in PHP. However, you should be aware that pagination is subject to certain Criticism.
On MySQL server, the same results can be obtained by the following query:
On Microsoft SQL Server and Oracle DB server, the same results can be obtained by the following query:
Other applications of limiting the number of query results are queries such as “TOP 10 best customers” or “TOP 10 smallest persons”. There is a certain confusion about how many results should such a query return. There are three options (assuming the top 10 smallest persons problem):
Each approach may return a different number of results (assuming the sample database):
The above returns 10 rows. Out of the 3 persons with height 168 (Tuan Fuchs, Gilda Summer, Alisha Householder with height 168, only one is shown (randomly selected).
The above query returns 12 rows, highest height is 168.
The above query return 13 rows, highest height is 169.
In this article I have covered some slightly advanced SQL topics – sub-queries and aggregation. However, you have to keep in mind that joins, sub-queries and aggregation are essential parts of SELECT queries. Then there is plenty of other interesting stuff which I completely skipped here. You need to be aware that SQL offers many different approaches to writing queries. I have shown some examples of similar queries and different approaches to same tasks. In SQL, it is very important that you understand very well what you want to do. Then you have to specify that very precisely in the query.
Unfortunately, there is nothing which will help you with analyzing the requirements and working out the way in which you write an SQL query. You need to train your brain to do it for you. This makes writing SQL much more difficult than programming a boilerplate code in PHP. On the other hand, it also makes it a very valued skill. Also please do believe me that rewriting a non-trivial SQL query into procedural language (like PHP) is not a good idea. It takes incredible amount of code which leads to an incredible amount of bugs.