Interoperability between DuckDB and GeoPandas
Introduction
DuckDB provides easy integration with Pandas, but what about GeoPandas? In this post, we will explore how to use DuckDB with GeoPandas and vice-versa for geospatial data processing.
Pandas DataFrame from DuckDB
DuckDB Provides simple to use to_df()
and df()
methods to convert DuckDB tables to Pandas DataFrames.
import duckdb
conn = duckdb.connect()
conn.sql("""CREATE TABLE t1 AS
SELECT 42 AS i, 84 AS j;""")
df = conn.table("t1").to_df()
# you can also query before converting to DataFrame
filtered_df = conn.sql("SELECT * FROM t1 WHERE i > 40;").df()
DuckDB from Pandas DataFrame
This is much more straightforward, you can simply use the dataframe object name directly in the sql query.
conn.sql("SELECT * FROM filtered_df WHERE i > 40;")
But, GeoPandas DataFrames are not supported out of the box.
The issue arises due to the Geometry column not being detected by DuckDB. So, converting to WKB or WKT format before loading into DuckDB or GeoPandas is the way to go.
GeoPandas DataFrame from DuckDB
def load_gdf_into_duckdb(conn:duckdb.DuckDBPyConnection, gdf:gpd.GeoDataFrame, duck_table:str) -> str:
df = gdf.to_wkb()
conn.sql("CREATE TABLE df_table_blob AS (SELECT * FROM df);")
conn.sql(f"CREATE TABLE {duck_table} AS (SELECT * EXCLUDE(geometry), ST_GeomFromWKB(geometry) AS geom FROM df_table_blob);")
return duck_table
DuckDB table from GeoPandas DataFrame
def gdf_from_duckdb(conn:duckdb.DuckDBPyConnection, duck_table:str) -> gpd.GeoDataFrame:
temp_df = conn.sql(f"SELECT * EXCLUDE(geom), ST_AsText(geom) AS geometry FROM {duck_table};").df()
gdf = gpd.GeoDataFrame(temp_df)
gdf.geometry = gpd.GeoSeries.from_wkt(temp_df["geometry"])
return gdf
GeoPandas seems to have some issues loading WKB generated by DuckDB, so WKT is used here.
Happy to see DuckDB and GeoPandas working together! 🦆🐼
Let me know if you have a better way to do this.
Cheers! 🍻