Federated Querying using DuckDB on blockchain data

KowalskiDeFi
9 min readMar 20, 2024

--

A brief venture into the DuckDB world, this is hopefully part 1 of N investigations and deep-dives. Please take everything with a pinch of salt, given this is my first time using the technology. This is a brain-melt and is not meant to be a super specific and targeted article. I am not being paid to write any of my opinions, they come purely from using DuckDB and thinking about ways to use it in the future.

Intro on Federated Queries

Federated queries are a powerful feature that allows users to query data across multiple databases or data sources as if they were a single, unified database. This capability is particularly useful in scenarios where data is distributed across different systems, and users need to perform complex queries that involve data from multiple sources.

Figure 1: A productive DuckDB setup for large (100M+ size) analytical workloads (reference: https://thedataquarry.com/posts/embedded-db-1/)

The syntax for federated queries is similar to standard SQL join queries, with the addition of specifying the location of the tables being joined. This location information can include the dataset ID, table name, and column names, which are essential for accurately referencing the data from different sources. For example, in a scenario where you have a dataset with client account information and another dataset with additional account details, you can use a federated query to combine these tables based on a common identifier, such as an account ID.

To simplify the query and avoid having to write out the full dataset and table names each time, you can use table aliases. This allows you to reference the tables more concisely within your query. Additionally, if the datasets you are querying are part of the same project, you can achieve implicit federation by simply referencing the tables without needing to specify their dataset location.

Federated queries offer several advantages, including:

  • Simplifying data access by providing a unified view of data from multiple sources.
  • Allowing for real-time data access without the need for caching or data movement.
  • Reducing the need for physical data consolidation, which can be costly and disruptive.

However, there are also challenges associated with federated queries, such as:

  • Difficulty in knowing where the underlying data is stored.
  • Potential performance issues compared to traditional data warehousing solutions.
  • The inability to write to the external data sources from the federated query engine, limits operations to read-only access.

In summary, federated queries provide a flexible and powerful way to access and combine data from multiple sources, offering significant benefits in terms of data access and analysis. However, it’s important to be aware of the potential challenges and limitations and to consider these factors when designing and implementing federated query strategies.

DuckDB for Blockchain Analysis

At the bare minimum, DuckDB offers a simple and meaningful wrapper to quickly fetch data from your DB and run some aggregations in your favourite programming language. Up until now, you can install and use DuckDB in all major programming languages and platforms.

Figure 2: Bare minimum usage: Use as a wrapper to connect + query data into your favourite way to then process that data.

Using DuckDB, you can create native tables from your DB like so:

conn.query("CREATE TABLE duck_logs AS FROM db.ethereum_mainnet.logs;")

Then combine your table with data from all other sources such as parquet, parquet+iceberg, mySQL, raw JSON, SQLite and many other formats.

A very common use case in the blockchain space is to join different feeds of data, from price data to labels to other types of data such as user actions from a UI, to On-chain, indexed data. All of the above data can be built into completely different streams and distributed into a bunch of different formats. This is where federated query-enabling tools such as DuckDB shine. In terms of allowing the user to create read-references to all of these files and do the join in memory.

Since a lot of the time json or label CSV data are not heavy to load into memory, it does not matter if they come from a Spreadsheet from a data warehouse or some other format. Using DuckDB, allows you to do a quick in-memory join, without having to create extra pipelines that may create redundancies in storing the same data, duplication as well as many other Data Quality issues. Not to mention the time and effort required to build custom pipelines for all of the other sources into a unified source.

It is for this reason, that I think that for most common blockchain analysis applications, it makes sense to use DuckDB instead of Spark for performing analytical aggregations on the fly! The ability to do all that in-house, with just a few lines of code, and equivalent if not better performance compared to Spark [7][8], for all your aggregation needs, is unparalleled!

What I am proposing here is not an original idea at all, but is still controversial! So let’s break it down…

The case for most blockchain workloads not requiring Spark

For datasets that do not surpass petabytes in size, it may not be time-efficient to distribute the data across multiple nodes for processing. Given the significant computational power of modern single machines, performing calculations on such datasets whilst leveraging Spark’s distributed processing capabilities could be less optimal.

Spark is designed to efficiently handle large-scale data processing tasks, including those that involve petabytes of data, by optimizing memory usage, code execution, and data shuffling across the cluster. This design allows Spark to perform complex computations more quickly and effectively than traditional single-node processing, especially when dealing with large volumes of data.

Figure 3: Findings for spark vs DuckDB + DBT, shows that spark on average is at median 2x slower (reference: https://medium.com/datamindedbe/use-dbt-and-duckdb-instead-of-spark-in-data-pipelines-9063a31ea2b5)

A concrete example of something that cannot be run on DuckDB + DBT faster than it can be run on a Spark + DBT job is any aggregation dealing with Solana data, where you are dealing with close to petabyte(s) worth of data.

Blockchain data in general are cryptographically hashed. Meaning that there are a lot of hash -like fields/columns which become points of interest. Because of that, such data should (ideally) be in some form of database, so that these hash-like fields can be indexed and point-queries to them can return at very low speeds.

But… There is also a need for analytical queries on the same data, for these reasons, the data needs to live in a columnar-store accessible way. The most common way to perform analytics on blockchain data is by storing the data in the parquet format. Storing the data in parquet format as well as using some global index (or non-composite partition key) say block_number or block_date in combination with Apache Spark, Databricks, Flink, Google Dataflow or others.

Whilst this may be fine, you may be fine paying a premium for these services on top of your recently exported parquet data, smaller-size companies or Data Analysts with very specific analytics needs, may not have the same appetite or burn-rate leeway to experiment with such things.

The solution to all of these situations is using DuckDB, creating duck_db native tables (references) and then performing all the joins for your aggregation needs in memory on a very performant EC2 or (even better) machine (actual hardware) to be even more efficient (both in performance and cost).

Best Practices and My Findings

This section focuses on postgres read performance using direct connector and duck_db reference tables.

Reading the DuckDB documentation on best practices, there is a section referring to The Effect of Ordering and how ordering DATETIME columns yield to faster query times.

In my experience, this is not the case, as I have tried both a btree index on the block_date DATETIME field as well as block_time TIMESTAMP field and it made the total query fetch time slower instead of faster. Although* this might be different when regarding columnar operations, such as AVG , EXTRACT for seconds/minutes etc from DATETIME or TIMESTAMP fields.

Comparing 1:1 in terms of a simple SELECT * query, it seems that not ordering the result yields a faster result than ordering by the DATETIME or TIMESTAMP field.

%timeit conn.query("SELECT * FROM db.ethereum_mainnet.logs ORDER BY block_time LIMIT 100000;")
# 458 ms ± 8.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit conn.query("SELECT * FROM db.ethereum_mainnet.logs ORDER BY block_date LIMIT 100000;")
# 445 ms ± 8.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit conn.query("SELECT * FROM db.ethereum_mainnet.logs LIMIT 100000;")
# 303 ms ± 8.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With indices:

Indexes:
"logs_pkey" PRIMARY KEY, btree (block_number, transaction_hash, log_index)
"logs_block_date_idx" btree (block_date)
"logs_contract_address_idx" btree (contract_address)
"logs_date_idx" btree (date(block_time))
"logs_topic_1_idx" btree ((topics[1]) NULLS FIRST)

Filtering on DATETIME or TIMESTAMP fields will have the expected effect. Since logs_date_idx == logs_block_date_idx any filtering on them will have the same performance:

%timeit conn.query("SELECT * FROM db.ethereum_mainnet.logs WHERE block_time >= '2024-03-06' LIMIT 100000;")
85 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit conn.query("SELECT * FROM db.ethereum_mainnet.logs WHERE block_date >= '2024-03-06' LIMIT 100000;")
84.3 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But if you have the luxury (size-wise) of being able to store a direct block_time btree index, you can get slightly better performance.

%timeit conn.query("SELECT * FROM db.ethereum_mainnet.logs WHERE block_time >= '2024-03-06' LIMIT 100000;")
79.7 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

This is all quite standard PostgreSQL performance tuning. After all, it seems the Performance Best Practices that DuckDB offers are related to the in-memory tables created using DuckDB, and may not apply when doing performing queries on top of the main data source itself.

When creating a DuckDB native table from your PostgreSQL connection, performance seems to be pretty much the same, as described above:

conn.query("CREATE TABLE logs AS FROM db.ethereum_mainnet.logs;")
%timeit -n 7 conn.query("SELECT * FROM logs ORDER BY block_time DESC;")
141 µs ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 7 loops each)
%timeit -n 7 conn.query("SELECT * FROM logs;")
98.5 µs ± 36.5 µs per loop (mean ± std. dev. of 7 runs, 7 loops each)

Meaning that there is no inherited performance, at least not practically from using native DuckDB tables at the moment. When reading the earlier section, the reader can see that there is a slight degradation of performance when creating a DuckDB table and using that to query on top of it.

Summary

DuckDB is a new, lightweight and innovative way to take away the need for traditional distributed data Tools as a solution for everything. DuckDB shines in providing an easy-to-use interface (at the bare minimum), for analytical use cases on top of large data, whilst also providing much cleaner and up-to-date SQL syntax integration and better performance when reading parquet and parquet+ iceberg files. Saying that DuckDB is brand new and in its current state v0.10.1 (stable), is missing quite a lot of easy-to-add features.

That being said, I am personally looking forward to using this in a production-grade environment and moving my own personal (outside work) workload to use DuckDB to perform any complex aggregations as well as Python to do in-memory de-duplication and append.

References

  1. DuckDB Alternatives and competitive landscape: https://thedataquarry.com/posts/embedded-db-1/
  2. DuckDB PostgreSQL Extension: https://duckdb.org/docs/extensions/postgres.html
  3. DuckDB Best Practices: https://duckdb.org/docs/guides/performance/indexing
  4. DuckDB DBT: https://docs.getdbt.com/docs/core/connect-data-platform/duckdb-setup
  5. DuckDB Install page: https://duckdb.org/docs/installation/?version=stable&environment=cli&platform=linux&download_method=package_manager
  6. DuckDB x Iceberg API: https://towardsdatascience.com/boost-your-cloud-data-applications-with-duckdb-and-iceberg-api-67677666fbd3
  7. DuckDB “vs” Spark: https://motherduck.com/blog/making-pyspark-code-faster-with-duckdb/
  8. DuckDB + DBT vs Spark (or Spark DBT): https://medium.com/datamindedbe/use-dbt-and-duckdb-instead-of-spark-in-data-pipelines-9063a31ea2b5

--

--

No responses yet