ERROR: column “mystery” must appear in the GROUP BY clause or be used in an aggregate function
I want to share my thoughts about how to think straight about why we need to apply aggregate function when we used
GROUP BY clause in a SQL statement.
I assume that you’ve already had a basic understanding of
GROUP BY and “aggregate functions” in SQL. If you know these two terms and ever used them but are fuzzy about them, I hope this post could help.
A simple visualization of
It’s useful to image we are splitting a table, which consists of rows and columns, horizontally into different groups by common values with the given column.
If we have this table, and we want to apply
GROUP BY to
The common values from
GROUP BY column_name
We can find the common values of
col_one , they are
fuel . This is a bit like
DISTINCT . And it is very straightforward,
GROUP BY a column is just to pick out distinct values of that column, each distinct value corresponds to one group.
Split table horizontally by different groups
Now let’s visually split the table horizontally based on the groups we get.
But we mentioned a group is a distinct value, we have group
apple not group
apple apple apple . So actually what we get is like this:
Add SELECT list into the picture
Now we get 3 groups:
fuel . If we execute SQL like:
SELECT col_one FROM table_name GROUP BY col_one , it will return 3 rows, each row contains a group name. But typically we want to include other columns while applying grouping. We want to know what groups we have, we also want to know extra information about each group, maybe we want to know how many apples we have, or the total weight of all the apples etc.
Say if we want to include
col_two into our result and this column has some text data that includes some color information.
The paradox here is we get 3 group rows from
col_one , but 6 rows from
col_two . If we execute SQL like
SELECT col_one, col_two FROM table_name GROUP BY col_one; we would get an error
ERROR: column “table_name.col_two” must appear in the GROUP BY clause or be used in an aggregate function .
The inconsistency between the number of groups and the number of rows of
col_two leads to this problem. Let’s take the
apple group as an example:
The 3 rows of data
'purple' need to be some way compressed into a single row(value) to fit in the
apple group. And this is also true to other groups. By only doing this can we form a meaningful result from the query.
Use aggregate function as “compressor”
The error message we got earlier gave us some hint about the solution — aggregate function.
In database management, an aggregate function or aggregation function is a function where the values of multiple rows are grouped together to form a single summary value.
When one group ties up with multiple rows within one or more selected columns. We can apply aggregate function to each column to get a “summary value”. Essentially it’s just a function that can take one or more arguments(in this case they’re the rows) and return a single value. By doing this the return value of an aggregation function based on multiple rows corresponds to their group can now together form a meaningful row of data.
A note about the execution order of
GROUP BY and aggregate functions: grouping happens before aggregation so that aggregation functions only handle rows of values inside each independent group.
Back to the example of
apple group. What aggregate functions can we apply to the 3 rows relates to the group? There’re many aggregate functions.
If we apply
col_two we may get a SQL query like:
SELECT col_one, count(col_two) FROM table_name GROUP BY col_one; . And the
count() function will return the number of rows of the given column. In this case the number of rows is limited by the group, and it’s
Not only can we apply aggregation functions about math such as count the number of rows(
count) or find the maximum value across the rows(
max), we can also use string aggregation functions like
string_agg(col_name, separator) . This function concatenates strings from multiple rows with the given separator.
The SQL of this could be:
SELECT col_one, string_agg(col_two, ', ') FROM table_name GROUP BY col_one; . Note that
string_agg requires two arguments. Also note that in both examples the second column name has changed to the name of the function.
Think about a table visually
I found it helpful to think visually when facing SQL queries with
GROUP BY in. Just like we often splitting and joining tables vertically such as normalizing a database or joining multiple tables. I think many people have applied visual imagination when learning or working with SQL. It’s a useful way.
Consistency is the key
Our simple example only involves 1 extra columns in the select list. As we add more and more columns into the select list, we just need to carefully check if there is inconsistency of number of rows between the group and the selected columns.
More to explore
There are certainly more complicated case about grouping and aggregation in SQL for example what if we want to grouping by multiple columns. But I believe this visual, step-by-step way of thinking about grouping can be served as a good starting point.
Hope this post can be helpful.