# ERROR: column “mystery” must appear in the GROUP BY clause or be used in an aggregate function

I want to share my thoughts about how to think straight about why we need to apply aggregate function when we used `GROUP BY`

clause in a SQL statement.

I assume that you’ve already had a basic understanding of `GROUP BY`

and “aggregate functions” in SQL. If you know these two terms and ever used them but are fuzzy about them, I hope this post could help.

# A simple visualization of `GROUP BY`

It’s useful to image we are splitting a table, which consists of rows and columns, horizontally into different groups by common values with the given column.

If we have this table, and we want to apply `GROUP BY`

to `col_one`

:

**The common values from ****GROUP BY column_name**

We can find the common values of `col_one`

, they are `apple`

, `cat`

and `fuel`

. This is a bit like `DISTINCT`

. And it is very straightforward, `GROUP BY`

a column is just to pick out distinct values of that column, each distinct value corresponds to one group.

**Split table horizontally by different groups**

Now let’s visually split the table horizontally based on the groups we get.

But we mentioned a group is a distinct value, we have group `apple`

not group `apple apple apple`

. So actually what we get is like this:

# Add SELECT list into the picture

Now we get 3 groups: `apple`

, `cat`

and `fuel`

. If we execute SQL like: `SELECT col_one FROM table_name GROUP BY col_one`

, it will return 3 rows, each row contains a group name. But typically we want to include other columns while applying grouping. We want to know what groups we have, we also want to know extra information about each group, maybe we want to know how many apples we have, or the total weight of all the apples etc.

Say if we want to include `col_two`

into our result and this column has some text data that includes some color information.

The paradox here is we get 3 group rows from `col_one`

, but 6 rows from `col_two`

. If we execute SQL like `SELECT col_one, col_two FROM table_name GROUP BY col_one;`

we would get an error `ERROR: column “table_name.col_two” must appear in the GROUP BY clause or be used in an aggregate function`

.

The inconsistency between the number of groups and the number of rows of `col_two`

leads to this problem. Let’s take the `apple`

group as an example:

The 3 rows of data `'red'`

, `'green'`

and `'purple'`

need to be some way compressed into a single row(value) to fit in the `apple`

group. And this is also true to other groups. By only doing this can we form a meaningful result from the query.

# Use aggregate function as “compressor”

The error message we got earlier gave us some hint about the solution — *aggregate function*.

In database management, an aggregate function or aggregation function is a function where the values of multiple rows are grouped together to form a single summary value.

— wikipedia

When one group ties up with multiple rows within one or more selected columns. We can apply aggregate function to each column to get a “summary value”. Essentially it’s just a function that can take one or more arguments(in this case they’re the rows) and return a single value. By doing this the return value of an aggregation function based on multiple rows corresponds to their group can now together form a meaningful row of data.

A note about the execution order of `GROUP BY`

and aggregate functions: grouping happens before aggregation so that aggregation functions only handle rows of values inside each independent group.

Back to the example of `apple`

group. What aggregate functions can we apply to the 3 rows relates to the group? There’re many aggregate functions.

**one example**

If we apply `count()`

to `col_two`

we may get a SQL query like: `SELECT col_one, count(col_two) FROM table_name GROUP BY col_one;`

. And the `count()`

function will return the number of rows of the given column. In this case the number of rows is limited by the group, and it’s `3`

.

**another example**

Not only can we apply aggregation functions about math such as count the number of rows(`count`

) or find the maximum value across the rows(`max`

), we can also use string aggregation functions like `string_agg(col_name, separator)`

. This function concatenates strings from multiple rows with the given separator.

The SQL of this could be: `SELECT col_one, string_agg(col_two, ', ') FROM table_name GROUP BY col_one;`

. Note that `string_agg`

requires two arguments. Also note that in both examples the second column name has changed to the name of the function.

# Summary

**Think about a table visually**

I found it helpful to think visually when facing SQL queries with `GROUP BY`

in. Just like we often splitting and joining tables vertically such as normalizing a database or joining multiple tables. I think many people have applied visual imagination when learning or working with SQL. It’s a useful way.

**Consistency is the key**

Our simple example only involves 1 extra columns in the select list. As we add more and more columns into the select list, we just need to carefully check if there is inconsistency of number of rows between the group and the selected columns.

**More to explore**

There are certainly more complicated case about grouping and aggregation in SQL for example what if we want to grouping by multiple columns. But I believe this visual, step-by-step way of thinking about grouping can be served as a good starting point.

Hope this post can be helpful.