ERROR: column “mystery” must appear in the GROUP BY clause or be used in an aggregate function

xullnn
5 min readFeb 6, 2021

I want to share my thoughts about how to think straight about why we need to apply aggregate function when we used GROUP BY clause in a SQL statement.

I assume that you’ve already had a basic understanding of GROUP BY and “aggregate functions” in SQL. If you know these two terms and ever used them but are fuzzy about them, I hope this post could help.

A simple visualization of GROUP BY

It’s useful to image we are splitting a table, which consists of rows and columns, horizontally into different groups by common values with the given column.

If we have this table, and we want to apply GROUP BY to col_one :

The common values from GROUP BY column_name

We can find the common values of col_one , they are apple , cat and fuel . This is a bit like DISTINCT . And it is very straightforward, GROUP BY a column is just to pick out distinct values of that column, each distinct value corresponds to one group.

Split table horizontally by different groups

Now let’s visually split the table horizontally based on the groups we get.

But we mentioned a group is a distinct value, we have group apple not group apple apple apple . So actually what we get is like this:

Add SELECT list into the picture

Now we get 3 groups: apple , cat and fuel . If we execute SQL like: SELECT col_one FROM table_name GROUP BY col_one , it will return 3 rows, each row contains a group name. But typically we want to include other columns while applying grouping. We want to know what groups we have, we also want to know extra information about each group, maybe we want to know how many apples we have, or the total weight of all the apples etc.

Say if we want to include col_two into our result and this column has some text data that includes some color information.

The paradox here is we get 3 group rows from col_one , but 6 rows from col_two . If we execute SQL like SELECT col_one, col_two FROM table_name GROUP BY col_one; we would get an error ERROR: column “table_name.col_two” must appear in the GROUP BY clause or be used in an aggregate function .

The inconsistency between the number of groups and the number of rows of col_two leads to this problem. Let’s take the apple group as an example:

The 3 rows of data 'red' , 'green' and 'purple' need to be some way compressed into a single row(value) to fit in the apple group. And this is also true to other groups. By only doing this can we form a meaningful result from the query.

Use aggregate function as “compressor”

The error message we got earlier gave us some hint about the solution — aggregate function.

In database management, an aggregate function or aggregation function is a function where the values of multiple rows are grouped together to form a single summary value.

— wikipedia

When one group ties up with multiple rows within one or more selected columns. We can apply aggregate function to each column to get a “summary value”. Essentially it’s just a function that can take one or more arguments(in this case they’re the rows) and return a single value. By doing this the return value of an aggregation function based on multiple rows corresponds to their group can now together form a meaningful row of data.

A note about the execution order of GROUP BY and aggregate functions: grouping happens before aggregation so that aggregation functions only handle rows of values inside each independent group.

Back to the example of apple group. What aggregate functions can we apply to the 3 rows relates to the group? There’re many aggregate functions.

one example

If we apply count() to col_two we may get a SQL query like: SELECT col_one, count(col_two) FROM table_name GROUP BY col_one; . And the count() function will return the number of rows of the given column. In this case the number of rows is limited by the group, and it’s 3 .

another example

Not only can we apply aggregation functions about math such as count the number of rows(count) or find the maximum value across the rows(max), we can also use string aggregation functions like string_agg(col_name, separator) . This function concatenates strings from multiple rows with the given separator.

The SQL of this could be: SELECT col_one, string_agg(col_two, ', ') FROM table_name GROUP BY col_one; . Note that string_agg requires two arguments. Also note that in both examples the second column name has changed to the name of the function.

Summary

Think about a table visually

I found it helpful to think visually when facing SQL queries with GROUP BY in. Just like we often splitting and joining tables vertically such as normalizing a database or joining multiple tables. I think many people have applied visual imagination when learning or working with SQL. It’s a useful way.

Consistency is the key
Our simple example only involves 1 extra columns in the select list. As we add more and more columns into the select list, we just need to carefully check if there is inconsistency of number of rows between the group and the selected columns.

More to explore

There are certainly more complicated case about grouping and aggregation in SQL for example what if we want to grouping by multiple columns. But I believe this visual, step-by-step way of thinking about grouping can be served as a good starting point.

Hope this post can be helpful.

--

--