Forward and backfilling data on CH

Examples_Workspace / Forward and backfilling data on CH

Many times you have data with null values and you want to fill them with non-null values.

This is how to do forward and back filling null values on ClickHouse. It involves using some array functions, and we'll explain step-by-step how to do it and how they work.

This was also posted on our blog.

Last updated 2 years ago

raw_data

This is a sample dataset, with a numeric column and a string column with some null values.

We want to fill nulls with the previous and next non-null values, something you'd do in Pandas with the .fillna method or on Postgres like this
```
SELECT * FROM num_str
ORDER BY n
```
88.00B processed, 7 rows x 2 columns. This query took 1.2ms
· View chart
n

Int16

v

Nullable(String)

1 A

2 B

3 None

4 D

5 None

6 None

7 G

n Int16	v Nullable(String)
1	A
2	B
3	None
4	D
5	None
6	None
7	G

final_result

This is how you'd do it on ClickHouse.

Correlated subqueries as the one used on Postgres aren't available on ClickHouse, but we can do it using array functions.

In the next nodes you can see what's happening in each of the parts of this query.

SELECT
  tuples.1 n,
  tuples.2 v,
  tuples.3 v_ffill,
  tuples.4 v_bfill
FROM
  (SELECT
      arrayJoin(
        arrayZip(
          groupArray(n) AS n,
          arrayMap(x -> x != 'wadus' ? x : null, groupArray(v_nulls_replaced)) AS v,
          arrayFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_ffill,
          arrayReverseFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_bfill
        )
      ) tuples
  FROM
     (SELECT 
        *,
        ifNull(v, 'wadus') v_nulls_replaced
      FROM num_str 
      ORDER BY n ASC)
  )

88.00B processed, 7 rows x 4 columns. This query took 2.37ms

· View chart

n Int16	v Nullable(String)	v_ffill String	v_bfill String
1	A	A	A
2	B	B	B
3	None	B	D
4	D	D	D
5	None	D	G
6	None	D	G
7	G	G	G

replacing_null_values

The core part of the previous query are the functions arrayFill and arrayReverseFill. They take arrays, and the way to generate an array from a column on CH is with the groupArray function.

All of the aggregate functions on ClickHouse skip null values, so the workaround is replacing them by a value we know is not present in the column. Make sure that you sort the data properly here.
```
SELECT
    *,
    ifNull(v, 'wadus') v_nulls_replaced
FROM num_str 
ORDER BY n ASC
```
88.00B processed, 7 rows x 3 columns. This query took 1.19ms
· View chart
n

Int16

v

Nullable(String)

v_nulls_replaced

String

1 A A

2 B B

3 None wadus

4 D D

5 None wadus

6 None wadus

7 G G

n Int16	v Nullable(String)	v_nulls_replaced String
1	A	A
2	B	B
3	None	wadus
4	D	D
5	None	wadus
6	None	wadus
7	G	G

arrays

This is the most important part, where the actual forward and back filling takes place. arrayFill and arrayReverseFill take arrays as inputs and return arrays. In the next nodes we convert those arrays into individual rows.

SELECT
    groupArray(n) AS n,
    arrayMap(x -> x != 'wadus' ? x : null, groupArray(v_nulls_replaced)) AS v,
    arrayFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_ffill,
    arrayReverseFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_bfill
FROM replacing_null_values

88.00B processed, 7 rows x 4 columns. This query took 1.26ms

· View chart

n Array(Int16)	v Array(Nullable(String))	v_ffill Array(String)	v_bfill Array(String)
[1, 2, 3, 4, 5, 6, 7]	['A', 'B', None, 'D', None, None, 'G']	['A', 'B', 'B', 'D', 'D', 'D', 'G']	['A', 'B', 'D', 'D', 'G', 'G', 'G']

using_arrayZip

arrayZip combines N arrays with M elements into a single array with M tuples containing N elements each

SELECT arrayZip(n, v, v_bfill, v_ffill) zipped_array
FROM arrays

88.00B processed, 7 rows x 1 columns. This query took 1.12ms

· View chart

zipped_array Array(Tuple(Int16, Nullable(String), String, String))
[[1, 'A', 'A', 'A'], [2, 'B', 'B', 'B'], [3, None, 'D', 'B'], [4, 'D', 'D', 'D'], [5, None, 'G', 'D'], [6, None, 'G', 'D'], [7, 'G', 'G', 'G']]

using_arrayJoin

arrayJoin is the ClickHouse equivalent of unnest in Postgres. It takes an array of N elements and expands it to N rows with one element each.

SELECT arrayJoin(zipped_array) tuples FROM using_arrayZip

88.00B processed, 7 rows x 1 columns. This query took 1.33ms

· View chart

tuples Tuple(Int16, Nullable(String), String, String)
[1, 'A', 'A', 'A']
[2, 'B', 'B', 'B']
[3, None, 'D', 'B']
[4, 'D', 'D', 'D']
[5, None, 'G', 'D']
[6, None, 'G', 'D']
[7, 'G', 'G', 'G']

result_extracting_values_from_tuples

This is the final result. In it, we just have to extract each one of the values from the tuples created in the previous step.

SELECT
  tuples.1 n,
  tuples.2 v,
  tuples.3 v_ffill,
  tuples.4 v_bfill
 FROM using_arrayJoin

88.00B processed, 7 rows x 4 columns. This query took 1.27ms

· View chart

n Int16	v Nullable(String)	v_ffill String	v_bfill String
1	A	A	A
2	B	B	B
3	None	D	B
4	D	D	D
5	None	G	D
6	None	G	D
7	G	G	G

bonus_fill_with_constant

If you just need to fill the null values with a constant value, either setting it manually or getting it from the data (here v_mode_fill fills the null values with the most common value of thet column), this is how you'd do it.

SELECT
    n,
    v,
    ifNull(v, 'Z') v_simple_fill,
    ifNull(v, (SELECT topK(1)(v)[1] from raw_data) as mode) as v_mode_fill
FROM raw_data

88.00B processed, 7 rows x 4 columns. This query took 2.62ms

· View chart

n Int16	v Nullable(String)	v_simple_fill String	v_mode_fill String
1	A	A	A
2	B	B	B
3	None	Z	A
4	D	D	D
5	None	Z	A
6	None	Z	A
7	G	G	G

the_end

Tinybird lets you use a notebook-like interface like this to develop real-time endpoints on large amounts of data, powered by ClickHouse.

If you'd like to use it, sign up for a free account here
```
SELECT 'Thanks!'
```
1.00B processed, 1 rows x 1 columns. This query took 0.14ms
· View chart
'Thanks!'

String

Thanks!

Examples_Workspace / Forward and backfilling data on CH

raw_data

final_result

replacing_null_values

arrays

using_arrayZip

using_arrayJoin

result_extracting_values_from_tuples

bonus_fill_with_constant

the_end