Examples_Workspace / Forward and backfilling data on CH
Many times you have data with null
values and you want to fill them with non-null values.
This is how to do forward and back filling null
values on ClickHouse. It involves using some array functions, and we'll explain step-by-step how to do it and how they work.
This was also posted on our blog.
Last updated
-
raw_data
SELECT * FROM num_str ORDER BY n
88.00B processed, 7 rows x 2 columns. This query took 1.2msnInt16vNullable(String)1 A 2 B 3 None 4 D 5 None 6 None 7 G -
final_result
This is how you'd do it on ClickHouse.
Correlated subqueries as the one used on Postgres aren't available on ClickHouse, but we can do it using array functions.
In the next nodes you can see what's happening in each of the parts of this query.
SELECT tuples.1 n, tuples.2 v, tuples.3 v_ffill, tuples.4 v_bfill FROM (SELECT arrayJoin( arrayZip( groupArray(n) AS n, arrayMap(x -> x != 'wadus' ? x : null, groupArray(v_nulls_replaced)) AS v, arrayFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_ffill, arrayReverseFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_bfill ) ) tuples FROM (SELECT *, ifNull(v, 'wadus') v_nulls_replaced FROM num_str ORDER BY n ASC) )
88.00B processed, 7 rows x 4 columns. This query took 2.37msnInt16vNullable(String)v_ffillStringv_bfillString1 A A A 2 B B B 3 None B D 4 D D D 5 None D G 6 None D G 7 G G G -
replacing_null_values
The core part of the previous query are the functions
arrayFill
andarrayReverseFill
. They take arrays, and the way to generate an array from a column on CH is with thegroupArray
function.All of the aggregate functions on ClickHouse skip null values, so the workaround is replacing them by a value we know is not present in the column. Make sure that you sort the data properly here.
SELECT *, ifNull(v, 'wadus') v_nulls_replaced FROM num_str ORDER BY n ASC
88.00B processed, 7 rows x 3 columns. This query took 1.19msnInt16vNullable(String)v_nulls_replacedString1 A A 2 B B 3 None wadus 4 D D 5 None wadus 6 None wadus 7 G G -
arrays
This is the most important part, where the actual forward and back filling takes place.
arrayFill
andarrayReverseFill
take arrays as inputs and return arrays. In the next nodes we convert those arrays into individual rows.SELECT groupArray(n) AS n, arrayMap(x -> x != 'wadus' ? x : null, groupArray(v_nulls_replaced)) AS v, arrayFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_ffill, arrayReverseFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_bfill FROM replacing_null_values
88.00B processed, 7 rows x 4 columns. This query took 1.26msnArray(Int16)vArray(Nullable(String))v_ffillArray(String)v_bfillArray(String)[1, 2, 3, 4, 5, 6, 7] ['A', 'B', None, 'D', None, None, 'G'] ['A', 'B', 'B', 'D', 'D', 'D', 'G'] ['A', 'B', 'D', 'D', 'G', 'G', 'G'] -
using_arrayZip
arrayZip
combines N arrays with M elements into a single array with M tuples containing N elements eachSELECT arrayZip(n, v, v_bfill, v_ffill) zipped_array FROM arrays
88.00B processed, 7 rows x 1 columns. This query took 1.12mszipped_arrayArray(Tuple(Int16, Nullable(String), String, String))[[1, 'A', 'A', 'A'], [2, 'B', 'B', 'B'], [3, None, 'D', 'B'], [4, 'D', 'D', 'D'], [5, None, 'G', 'D'], [6, None, 'G', 'D'], [7, 'G', 'G', 'G']] -
using_arrayJoin
arrayJoin
is the ClickHouse equivalent ofunnest
in Postgres. It takes an array of N elements and expands it to N rows with one element each.SELECT arrayJoin(zipped_array) tuples FROM using_arrayZip
88.00B processed, 7 rows x 1 columns. This query took 1.33mstuplesTuple(Int16, Nullable(String), String, String)[1, 'A', 'A', 'A'] [2, 'B', 'B', 'B'] [3, None, 'D', 'B'] [4, 'D', 'D', 'D'] [5, None, 'G', 'D'] [6, None, 'G', 'D'] [7, 'G', 'G', 'G'] -
result_extracting_values_from_tuples
This is the final result. In it, we just have to extract each one of the values from the tuples created in the previous step.
SELECT tuples.1 n, tuples.2 v, tuples.3 v_ffill, tuples.4 v_bfill FROM using_arrayJoin
88.00B processed, 7 rows x 4 columns. This query took 1.27msnInt16vNullable(String)v_ffillStringv_bfillString1 A A A 2 B B B 3 None D B 4 D D D 5 None G D 6 None G D 7 G G G -
bonus_fill_with_constant
If you just need to fill the null values with a constant value, either setting it manually or getting it from the data (here
v_mode_fill
fills the null values with the most common value of thet column), this is how you'd do it.SELECT n, v, ifNull(v, 'Z') v_simple_fill, ifNull(v, (SELECT topK(1)(v)[1] from raw_data) as mode) as v_mode_fill FROM raw_data
88.00B processed, 7 rows x 4 columns. This query took 2.62msnInt16vNullable(String)v_simple_fillStringv_mode_fillString1 A A A 2 B B B 3 None Z A 4 D D D 5 None Z A 6 None Z A 7 G G G -
the_end
Tinybird lets you use a notebook-like interface like this to develop real-time endpoints on large amounts of data, powered by ClickHouse.
If you'd like to use it, sign up for a free account here
SELECT 'Thanks!'
1.00B processed, 1 rows x 1 columns. This query took 0.14ms'Thanks!'StringThanks!