DuckDB ↔ Arrow Compatibility: A Status Page

Most of my work at Query.Farm lives at the boundary between DuckDB and the Apache Arrow ecosystem — the Airport extension for Arrow Flight, GeoArrow round-trips, ENUM and extension-type interop, and a long tail of edge cases in the Arrow C Data Interface. That work has produced a fair number of issues and PRs against duckdb/duckdb over the past few months.

This post is not a complaint. DuckDB’s Arrow surface is broad, the team has been responsive, and most of the simple correctness fixes were merged within days. What I want to do here is give the community — and anyone else building on the DuckDB ↔ Arrow boundary — a single page they can scan to see the state of things. Think of it as an unofficial status page, and a snapshot of what I learned along the way.

How this list happened

A lot of these issues come from one weekend in early April when I sat down with the Arrow C Data Interface code in DuckDB and read it carefully — partly to understand it for my own work on Airport, partly because I’d been hitting the kind of “wrong answer, no error” symptoms that suggest something deeper than a single bug. The result was a cluster of issues filed on April 5, plus a handful before and after as I worked through specific code paths.

The pattern that emerged: the well-traveled Arrow paths in DuckDB are solid. But several of the less-used types — UNION, Run-End Encoded, dictionary-with-nulls, sliced arrays, GeoArrow CRS metadata, ENUM round-trips, extension types inside containers — had silent-correctness bugs of the same general shape. Schema and data agree as long as you stay on the hot path; they drift apart at the corners.

The UNION canary

If there’s one type that pulled the most bugs out of the codebase, it’s UNION. UNION is the canary because it’s complex (sparse vs. dense layouts, type-id-to-child mapping, per-row tags) and underused (DuckDB-to-DuckDB Arrow flows always emit identity mappings, so DuckDB’s own integration tests rarely exercise the edge cases). The combination is exactly what you’d expect to harbor latent bugs — and it does.

Four distinct issues fell out:

#21846 (fixed in #21848) — UNION type_ids buffer ignored chunk_offset. Every other type in ColumnArrowToDuckDB used GetEffectiveOffset() to compute the correct read position. UNION read from position 0 every pass. A 4096-row Arrow chunk got scanned in two 2048-row passes; the second pass read the wrong type tag for every row but the correct child value, so each row got the wrong union variant — silently. Same problem for sliced arrays with array.offset != 0.
#21850 (fixed in #21851) — UNION appender wrote out-of-bounds when from > 0. Triggered when ArrowUtil::TryFetchChunk resumed a partially consumed DuckDB chunk. Temporary child vectors were sized to - from but indexed with the raw input_idx starting at from. Every other appender handled this correctly with 0-based indexing; UNION was the lone exception writing past the end of the heap allocation.
#21842 (PR #21843, in review) — UNION format string type-id mapping was parsed but ignored. The Arrow spec says a format string like +us:5,7,9 maps type ID 5 to child 0, 7 to child 1, 9 to child 2. Why non-identity IDs? Typically because variants were added and removed over time, and surviving variants keep their original IDs so existing data stays readable without rewriting every row. DuckDB’s own writer always emits identity mappings (+us:0,1,2,...), so DuckDB-to-DuckDB roundtrips were fine — but ingesting unions from PyArrow, Arrow C++, DataFusion, or Polars would either crash with “tag out of range” or silently assign values to the wrong variant if the IDs happened to be valid indices in the wrong order. The parsing code even had a TODO acknowledging it: // TODO: what are these type ids actually for?
#22444 (PR #22445, in review) — BOOLEAN child of UNION written bit-packed but exported as arrow.bool8 (byte-packed). This is the one I find most instructive, and I’ll come back to it below — the symptom was a UNION bug, but the root cause turned out to be a class of bugs across every container appender.

While I was in there, I also opened #21898 — Arrow dense union read/write support. DuckDB only spoke sparse unions on the wire. Dense unions (+ud:) are a first-class part of the Arrow spec and are what other engines emit by default for skewed data. The PR adds read support and an opt-in arrow_output_dense_union write setting. (It also folds in the type-id mapping fix, since dense unions need it anyway.)

That’s four distinct correctness bugs and one missing feature — all in one type. It’s a good demonstration that “DuckDB supports Arrow unions” was true in the marketing-bullet sense and only mostly true in practice.

Schema and data disagree: a class of nested-appender bugs

The most recent UNION issue — #22444 — looked like another small UNION fix when I opened it. A BOOLEAN child of a UNION, with arrow_lossless_conversion = true, came out corrupted: [True, True, True, True] going in, [True, False, False, False] coming out. The schema declared the child as arrow.bool8 (byte-packed, 1 byte per row), but the appender wrote it bit-packed (1 bit per row). Row 0 happened to look right by accident because the bit-packed byte 0xff reads back as True byte-wise too; everything after row 0 read zero bytes and came out False.

Tracing it, though, the bug wasn’t really about UNION. The top-level ArrowAppender constructor handled extension types correctly: when arrow_lossless_conversion = true declared a column as arrow.bool8, the data path routed values through the extension’s duckdb_to_arrow callback and wrote bytes. But every nested container appender — STRUCT, LIST, FIXED_SIZE_LIST, MAP, UNION — called ArrowAppender::InitializeChild without the extension info, so children fell back to the plain logical-type appender. The schema kept saying arrow.bool8; the data kept being bit-packed. UNION was just the case I happened to find first.

The fix in #22445 centralizes extension handling so it propagates to every nesting level. While writing tests for it I also caught a related out-of-bounds write in LIST(BOOLEAN) whose total element count exceeded STANDARD_VECTOR_SIZE — which the centralized fix exposed and the test now covers.

The shape of this bug is the same as several others I’ve reported: schema and data agree on the hot path, then drift apart at the corners. That mental model — “if schema and data are computed by different code paths, they will eventually disagree” — is the single best heuristic I’ve found for predicting where the next Arrow bug is hiding.

Type fidelity through round-trips

A separate category of issues isn’t about correctness inside a single read or write — it’s about types surviving a round-trip. DuckDB → Arrow → DuckDB should give you back what you put in. Often it does. Several specific cases don’t, yet:

#21084 (PR #21087) — ENUM loses identity through Arrow. ENUM exports as a plain dictionary with no ARROW:extension:name metadata, so it comes back as VARCHAR. Other DuckDB-specific types (HUGEINT, UUID, TIME_TZ, BIT, JSON) are correctly annotated with extension metadata under arrow_lossless_conversion = true; ENUM is the exception. The PR adds an arrow.duckdb.enum extension that preserves the ENUM identity with zero-copy index reads on the way back in.
#13947 — Nullability of Arrow fields not respected. A column registered as non-nullable via PyArrow shows up as nullable in duckdb_columns() and on the way back out. This one’s been open since 2024.
#22082 — Tensor-lite + arrow.fixed_shape_tensor interop. Builds on DuckDB’s existing nested ARRAY type to add FLOAT[2,3] syntax sugar, basic linear-algebra functions, and round-trip with the standard Arrow arrow.fixed_shape_tensor extension (including the optional permutation field for column-major data on import). Not a bug fix — a fidelity feature for the ML side of the Arrow ecosystem.

The thread that ties these together: Arrow’s extension-type mechanism is the right tool for preserving DuckDB-specific semantics through interop, and DuckDB uses it for some types but not yet for all of them. Each of these is the same shape of fix.

GeoArrow CRS: a small but sharp safety bug

#21853 (PR #21854, in review) is a double free in ArrowGeometry::WriteCRS. The function called yyjson_mut_doc_free(doc) on a yyjson_mut_doc* it didn’t own — the caller (PopulateSchema) already owned and freed it. Trigger: a malformed PROJJSON CRS string. Outcome: heap corruption.

This is a small bug, but it’s an interesting one because GeoArrow CRS data often comes in from external producers (PostGIS exports, GDAL, other tools), and a malformed PROJJSON from a third-party data source isn’t an exotic scenario. There’s also a memory leak in a separate error path on the same function, which the PR fixes too.

The meta-observation

#21849 is less of a bug report and more of a design conversation. While reviewing the Arrow interface code I noticed it’s largely written from the standpoint of trusting the producer:

ArrowBufferData() — no bounds check, no null check
schema.children[0] dereferenced without checking n_children
String view buffer_index only validated with D_ASSERT (compiled out in release)
Schema metadata lengths not validated (a negative int32 wraps to a huge uint64)
Structural invariants checked with D_ASSERT

DuckDB’s Parquet reader is the opposite: defensively coded against malicious or malformed input, because Parquet files frequently come from untrusted sources. The Arrow side has historically been treated as a higher-trust boundary — and for first-party producers (DuckDB → DuckDB, in-process PyArrow), that’s a reasonable default. But Arrow is increasingly the wire format for cross-process and cross-organization data: Arrow Flight endpoints, third-party producers in Rust, Java, Go, browser JS via Arrow JS. Once that data crosses a trust boundary, the Parquet-level discipline starts to make sense for Arrow too.

This isn’t a one-line patch and it isn’t urgent in the same way a silent-corruption bug is. But several of the bugs higher up in this post — the UNION type-id range issue, the array-offset issue, the GeoArrow double free — are exactly the kind of thing a stricter validation pass would catch at the boundary, before it became a wrong answer or a heap corruption.

What’s where, as of today

A condensed view, since “status page” should mean something:

Fixed and merged — on recent nightlies / next release:

#21082 / #21083 — buffer overread in Arrow dictionary scan with NULLs
#21844–#21845 / #21847 — REE INT64 run_ends silently corrupted
#21846 / #21848 — UNION type_ids ignored chunk_offset
#21850 / #21851 — UNION appender out-of-bounds when from > 0
#21852 — bugs in ArrowQueryResult
#21855, #21856 — dead-code cleanup in Arrow stream/batch wrappers
#18002 / #18005 — ArrowBool8 validity check
#14003 — ArrowArrayWrapper::GetNextChunk() made virtual

In review (PRs open):

#21843 — UNION type-id-to-child-index mapping
#21854 — GeoArrow CRS double free + leak
#21087 — ENUM extension metadata for round-trips
#21898 — dense union read/write
#22082 — tensor-lite + arrow.fixed_shape_tensor
#22445 — extension propagation in nested appenders

Open issues without a PR:

#21849 — Arrow interface validation vs. Parquet (design conversation)
#13947 — Python: nullability of Arrow fields not respected
#17990 — extension-source metadata across DuckDB versions (so tools can know which release ships arrow in core vs. community)

Closing thoughts

A few honest observations:

The hot path is solid. None of these bugs would be hit by someone reading a Parquet file into DuckDB, doing a GROUP BY, and exporting to Arrow. They live in the corners — UNION, REE, dictionary-with-nulls, sliced arrays, GeoArrow CRS, ENUM round-trips, extension types inside containers.
Silent corruption is the recurring shape. Several of the merged fixes produced wrong answers without crashing. That’s the worst kind of bug for an analytical engine, and it’s the strongest argument for #21849: the Arrow side should validate inputs at the same level the Parquet side does.
The team has been responsive. Most of the simple fixes merged within a week. The harder PRs (dense unions, ENUM round-trip, nested-appender extension propagation, tensor-lite) need more review time, which is fair — they touch shared infrastructure.
Cross-engine Arrow interop is the hardest part. A non-trivial fraction of these only surface when DuckDB consumes Arrow from a producer that isn’t DuckDB. If you write Arrow producers in a language that isn’t C++, you’re more likely to hit them. This is exactly the demographic that benefits most from DuckDB being Arrow-native, so it’s worth investing in.

If you’re building on DuckDB’s Arrow surface and run into something not on this list, please file it. The fastest path to a fix is a minimal repro plus a pointer at the relevant function. And if you’re hitting one of the open items above, a 👍 on the issue helps prioritize.

I’ll keep this post updated as the open items move. The intent is for it to get shorter over time, not longer.

DuckDB ↔ Arrow Compatibility: A Status Page

How this list happened

The UNION canary

Schema and data disagree: a class of nested-appender bugs

Type fidelity through round-trips

GeoArrow CRS: a small but sharp safety bug

The meta-observation

What’s where, as of today

Closing thoughts

Related Posts

GeoSilo vs GeoArrow + Byte Stream Split: Two Approaches to Geometry Compression

Quack and VGI: Two Approaches to Bringing RPC to DuckDB

Telemetry for DuckDB Extensions Without the Pain