Interoperability between DuckDB and GeoPandas

Introduction

DuckDB provides easy integration with Pandas, but what about GeoPandas? In this post, we will explore how to use DuckDB with GeoPandas and vice-versa for geospatial data processing.

Pandas DataFrame from DuckDB

DuckDB Provides simple to use to_df() and df() methods to convert DuckDB tables to Pandas DataFrames.

import duckdb

conn = duckdb.connect()
conn.sql("""CREATE TABLE t1 AS
    SELECT 42 AS i, 84 AS j;""")

df = conn.table("t1").to_df()

# you can also query before converting to DataFrame
filtered_df = conn.sql("SELECT * FROM t1 WHERE i > 40;").df()

DuckDB from Pandas DataFrame

This is much more straightforward, you can simply use the dataframe object name directly in the sql query.

conn.sql("SELECT * FROM filtered_df WHERE i > 40;")

But, GeoPandas DataFrames are not supported out of the box.

The issue arises due to the Geometry column not being detected by DuckDB. So, converting to WKB or WKT format before loading into DuckDB or GeoPandas is the way to go.

GeoPandas DataFrame from DuckDB

def load_gdf_into_duckdb(conn:duckdb.DuckDBPyConnection, gdf:gpd.GeoDataFrame, duck_table:str) -> str:
    df = gdf.to_wkb()
    
    conn.sql("CREATE TABLE df_table_blob AS (SELECT * FROM df);")
    conn.sql(f"CREATE TABLE {duck_table} AS (SELECT * EXCLUDE(geometry), ST_GeomFromWKB(geometry) AS geom FROM df_table_blob);")

    return duck_table

DuckDB table from GeoPandas DataFrame

def gdf_from_duckdb(conn:duckdb.DuckDBPyConnection, duck_table:str) -> gpd.GeoDataFrame:
    temp_df = conn.sql(f"SELECT * EXCLUDE(geom), ST_AsText(geom) AS geometry FROM {duck_table};").df()

    gdf = gpd.GeoDataFrame(temp_df)
    gdf.geometry = gpd.GeoSeries.from_wkt(temp_df["geometry"])

    return gdf

GeoPandas seems to have some issues loading WKB generated by DuckDB, so WKT is used here.

Happy to see DuckDB and GeoPandas working together! 🦆🐼

Let me know if you have a better way to do this.

Cheers! 🍻