Many times you have data with null
values and you want to fill them with non-null values.
This is how to do forward and back filling null
values on ClickHouse. It involves using some array functions, and we'll explain step-by-step how to do it and how they work.
This was also posted on our blog.
Last updated
SELECT * FROM num_str ORDER BY n
n
Int16
|
v
Nullable(String)
|
---|---|
1 | A |
2 | B |
3 | None |
4 | D |
5 | None |
6 | None |
7 | G |
This is how you'd do it on ClickHouse.
Correlated subqueries as the one used on Postgres aren't available on ClickHouse, but we can do it using array functions.
In the next nodes you can see what's happening in each of the parts of this query.
SELECT tuples.1 n, tuples.2 v, tuples.3 v_ffill, tuples.4 v_bfill FROM (SELECT arrayJoin( arrayZip( groupArray(n) AS n, arrayMap(x -> x != 'wadus' ? x : null, groupArray(v_nulls_replaced)) AS v, arrayFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_ffill, arrayReverseFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_bfill ) ) tuples FROM (SELECT *, ifNull(v, 'wadus') v_nulls_replaced FROM num_str ORDER BY n ASC) )
n
Int16
|
v
Nullable(String)
|
v_ffill
String
|
v_bfill
String
|
---|---|---|---|
1 | A | A | A |
2 | B | B | B |
3 | None | B | D |
4 | D | D | D |
5 | None | D | G |
6 | None | D | G |
7 | G | G | G |
The core part of the previous query are the functions arrayFill
and arrayReverseFill
. They take arrays, and the way to generate an array from a column on CH is with the groupArray
function.
All of the aggregate functions on ClickHouse skip null values, so the workaround is replacing them by a value we know is not present in the column. Make sure that you sort the data properly here.
SELECT *, ifNull(v, 'wadus') v_nulls_replaced FROM num_str ORDER BY n ASC
n
Int16
|
v
Nullable(String)
|
v_nulls_replaced
String
|
---|---|---|
1 | A | A |
2 | B | B |
3 | None | wadus |
4 | D | D |
5 | None | wadus |
6 | None | wadus |
7 | G | G |
This is the most important part, where the actual forward and back filling takes place. arrayFill
and arrayReverseFill
take arrays as inputs and return arrays. In the next nodes we convert those arrays into individual rows.
SELECT groupArray(n) AS n, arrayMap(x -> x != 'wadus' ? x : null, groupArray(v_nulls_replaced)) AS v, arrayFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_ffill, arrayReverseFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_bfill FROM replacing_null_values
n
Array(Int16)
|
v
Array(Nullable(String))
|
v_ffill
Array(String)
|
v_bfill
Array(String)
|
---|---|---|---|
[1, 2, 3, 4, 5, 6, 7] | ['A', 'B', None, 'D', None, None, 'G'] | ['A', 'B', 'B', 'D', 'D', 'D', 'G'] | ['A', 'B', 'D', 'D', 'G', 'G', 'G'] |
arrayZip
combines N arrays with M elements into a single array with M tuples containing N elements each
SELECT arrayZip(n, v, v_bfill, v_ffill) zipped_array FROM arrays
zipped_array
Array(Tuple(Int16, Nullable(String), String, String))
|
---|
[[1, 'A', 'A', 'A'], [2, 'B', 'B', 'B'], [3, None, 'D', 'B'], [4, 'D', 'D', 'D'], [5, None, 'G', 'D'], [6, None, 'G', 'D'], [7, 'G', 'G', 'G']] |
arrayJoin
is the ClickHouse equivalent of unnest
in Postgres. It takes an array of N elements and expands it to N rows with one element each.
SELECT arrayJoin(zipped_array) tuples FROM using_arrayZip
tuples
Tuple(Int16, Nullable(String), String, String)
|
---|
[1, 'A', 'A', 'A'] |
[2, 'B', 'B', 'B'] |
[3, None, 'D', 'B'] |
[4, 'D', 'D', 'D'] |
[5, None, 'G', 'D'] |
[6, None, 'G', 'D'] |
[7, 'G', 'G', 'G'] |
This is the final result. In it, we just have to extract each one of the values from the tuples created in the previous step.
SELECT tuples.1 n, tuples.2 v, tuples.3 v_ffill, tuples.4 v_bfill FROM using_arrayJoin
n
Int16
|
v
Nullable(String)
|
v_ffill
String
|
v_bfill
String
|
---|---|---|---|
1 | A | A | A |
2 | B | B | B |
3 | None | D | B |
4 | D | D | D |
5 | None | G | D |
6 | None | G | D |
7 | G | G | G |
If you just need to fill the null values with a constant value, either setting it manually or getting it from the data (here v_mode_fill
fills the null values with the most common value of thet column), this is how you'd do it.
SELECT n, v, ifNull(v, 'Z') v_simple_fill, ifNull(v, (SELECT topK(1)(v)[1] from raw_data) as mode) as v_mode_fill FROM raw_data
n
Int16
|
v
Nullable(String)
|
v_simple_fill
String
|
v_mode_fill
String
|
---|---|---|---|
1 | A | A | A |
2 | B | B | B |
3 | None | Z | A |
4 | D | D | D |
5 | None | Z | A |
6 | None | Z | A |
7 | G | G | G |
Tinybird lets you use a notebook-like interface like this to develop real-time endpoints on large amounts of data, powered by ClickHouse.
If you'd like to use it, sign up for a free account here
SELECT 'Thanks!'
'Thanks!'
String
|
---|
Thanks! |