Data Engineering

GeoSilo vs GeoArrow + Byte Stream Split: Two Approaches to Geometry Compression

A detailed comparison of GeoSilo's delta-integer encoding against GeoArrow's columnar float64 layout with Parquet BYTE_STREAM_SPLIT, covering compression ratios, precision tradeoffs, and when each approach wins.

Geometry data in DuckDB and GeoParquet is typically stored as WKB — raw IEEE 754 float64 coordinate pairs. This is simple and lossless, but wastes space because adjacent vertices in a polygon share nearly identical coordinate values. Two very different strategies have emerged for fixing this: GeoArrow with Parquet’s BYTE_STREAM_SPLIT encoding, and GeoSilo, a DuckDB extension I built at Query.Farm.

This post also covers how both relate to TWKB (Tiny Well-Known Binary), the prior art in compact geometry encoding.

How each approach works

GeoArrow + BYTE_STREAM_SPLIT + ZSTD

This is a columnar, lossless pipeline:

  1. GeoArrow flattens all vertices from all geometries into contiguous double[] arrays — one for X, one for Y (the “separated” layout mandated by GeoParquet).
  2. BYTE_STREAM_SPLIT transposes each array by byte position: all byte-0s together, all byte-1s together, creating 8 separate byte streams per coordinate column.
  3. Spatially clustered coordinates share high-order bytes, so streams 6–7 (sign + exponent) become highly repetitive.
  4. ZSTD compresses each byte stream efficiently.

The key insight is that IEEE 754 float64 values for nearby geographic coordinates differ mostly in their low-order mantissa bytes. Separating the bytes lets ZSTD apply different compression models to the stable high bytes and the noisier low bytes.

GeoSilo + ZSTD

This is a blob-per-row, lossy pipeline:

  1. Scale coordinates to integers (e.g., −75.509491 × 1e7 = −755094910).
  2. Store the first vertex per ring as absolute int32 (4 bytes per dimension).
  3. Store subsequent vertices as int16 deltas from the previous value (2 bytes per dimension).
  4. ~90% of deltas fit in int16; the remaining ~10% use a 6-byte escape (sentinel + int32).
  5. The fixed-width int16 deltas create highly repetitive byte patterns that ZSTD compresses further.

The data shrinks before the codec ever sees it — from 16 bytes per XY pair down to 4 bytes for ~90% of vertices.

Head-to-head comparison

GeoArrow + BSS + ZSTDGeoSilo + ZSTD
Compression vs WKB~0.5–0.6×~0.26–0.28×
PrecisionLossless float64Lossy (~1 cm at 1e7 scale)
Pre-compression sizeSame as input (BSS is a transpose, not a reduction)~0.30× of input
Geometry type constraintHomogeneous per column (all Polygon, all LineString, etc.) or fall back to WKBAny type per blob, including GeometryCollection
Decode costZero-copy — coordinates are directly addressable~3 GB/s single-threaded (int16 deltas → float64)
Random vertex accessO(1) via offset arraysMust scan from ring start
Spatial locality requirementDepends on row-group orderingExploits intra-geometry adjacency (always local)
EcosystemNative to Parquet/ArrowRequires GeoSilo DuckDB extension

The GeoArrow + BSS compression numbers come from the PARQUET-2414 benchmarks on Belgium OSM geographic data, which showed ~1.7–2.1× compression for float64 lat/lon with BSS + ZSTD vs plain ZSTD alone. The GeoSilo numbers come from benchmarking against US Census TIGER/Line 2025 data.

Why GeoSilo compresses ~2× better

The fundamental difference is where delta encoding happens.

BSS exploits inter-geometry byte similarity: coordinates in the same row group share high-order bytes because they’re in the same geographic region. But this signal depends entirely on how the data is ordered. Scattered global data with no spatial clustering sees minimal benefit.

GeoSilo exploits intra-geometry vertex adjacency: consecutive vertices in a polygon ring differ by tiny amounts by definition. This signal is structural, not dependent on row ordering. A polygon in Tokyo and a polygon in New York both have small vertex-to-vertex deltas — the absolute position doesn’t matter, only the relative movement.

This also explains why GeoSilo reduces data width before compression. Each XY coordinate pair goes from 16 bytes (two float64s) to 4 bytes (two int16 deltas) for the ~90% common case. BSS keeps the data at 8 bytes per coordinate and relies entirely on ZSTD to find patterns in the transposed byte streams.

Is lossy encoding actually a problem?

GeoSilo’s compression advantage comes from scaling float64 coordinates to integers — a lossy step. The word “lossy” sounds alarming, but in practice the precision loss is well below the noise floor of the source data.

What the numbers actually mean

At the default scale of 1e7, GeoSilo preserves coordinates to 7 decimal places in degrees. That’s a grid spacing of about 1.1 cm at the equator (0.0000001° × 111,320 m/°). For projected coordinate systems in meters, the default scale of 100 gives 1 cm precision.

To put that in context:

Data sourceTypical accuracyGeoSilo precision
Consumer GPS (phone)3–5 m1.1 cm
Survey-grade GPS (RTK)1–2 cm1.1 cm
US Census TIGER/Line2–5 m1.1 cm
OpenStreetMap1–10 m (varies wildly)1.1 cm
Satellite imagery (30 cm)30 cm1.1 cm
Cadastral/land survey1–10 cm1.1 cm

In every case except the most precise land surveys, the source data has error bars that are 10–500× larger than GeoSilo’s quantization step. The extra decimal places in a float64 coordinate are not precision — they’re noise from coordinate transformations, reprojection artifacts, or simply the way floating-point arithmetic works.

Float64 is already lying to you

A common misconception is that float64 coordinates are “exact.” They aren’t. IEEE 754 float64 gives you about 15–16 significant decimal digits, but geographic coordinates don’t have 16 digits of meaningful precision. Consider:

  • The coordinate -75.5094913482761 has 16 digits, but the last 8–9 are below the accuracy of any terrestrial survey method.
  • Reprojecting a coordinate from one CRS to another introduces rounding at the float64 level — the output is not bit-identical to the “true” value.
  • Spatial operations like ST_Buffer, ST_Intersection, and ST_Simplify all introduce coordinate perturbations far larger than 1 cm.

Preserving all 52 mantissa bits of a float64 geographic coordinate is preserving the format’s representational capacity, not the data’s actual information content.

When lossless genuinely matters

There are legitimate cases where bit-exact float64 roundtripping matters:

  • Computational geometry correctness — algorithms like Shewchuk’s robust predicates depend on exact floating-point arithmetic. If you’re feeding coordinates into a constrained Delaunay triangulation, you want the exact bits back.
  • Legal/regulatory requirements — some jurisdictions mandate that official boundary coordinates must not be modified in transit.
  • Hash-based deduplication — if you’re comparing geometries by hash to detect duplicates, any coordinate change breaks the comparison.
  • Differential updates — if downstream systems diff geometry to detect changes, quantization noise would trigger false positives.

For these cases, a lossless approach would achieve roughly 0.38–0.42× of WKB with ZSTD — about 40–50% larger than GeoSilo’s lossy encoding, but still better than GeoArrow + BSS (~0.5–0.6×). The technique would be similar: delta-encode the raw float64 bit patterns as int64, where adjacent vertices produce small deltas that compress well. Nobody has built this yet, but it’s a natural extension of the same idea.

The practical tradeoff

For the vast majority of geospatial workloads — visualization, spatial joins, area/distance calculations, geocoding, routing — 1 cm precision is far more than sufficient, and the 2× compression improvement over lossless approaches directly translates to faster I/O, lower storage costs, and smaller network payloads. The “lossy” label is technically accurate but practically misleading: you’re discarding noise, not signal.

Encode/decode performance

Compression ratio is only half the story — if encoding or decoding is slow, the storage savings don’t help. Here’s what GeoSilo’s throughput looks like on US Census TIGER/Line 2025 data (single-threaded, Apple M-series):

TableRowsWKB SizeEncodeDecode
block_group242,748209 MB75 ms (2,789 MB/s)71 ms (2,946 MB/s)
zcta533,791180 MB49 ms (3,673 MB/s)47 ms (3,830 MB/s)
tract85,529125 MB41 ms (3,039 MB/s)39 ms (3,195 MB/s)
county3,23525 MB7 ms (3,514 MB/s)7 ms (3,514 MB/s)

At ~3 GB/s, GeoSilo encode/decode is not the bottleneck in any realistic pipeline — disk I/O, network transfer, or the spatial computation itself will dominate.

End-to-end spatial operation overhead

The more relevant benchmark is: how much does decode add to actual spatial operations?

TableRowsRaw ST_AreaDecode + ST_AreaOverhead
block_group242,74818 ms83 ms+361%
county3,2352 ms8 ms+300%

The overhead looks large in percentage terms, but that’s because ST_Area on DuckDB’s native GEOMETRY format is extremely fast (18 ms for 242k complex polygons). In absolute terms, decoding 209 MB of geometry takes 71 ms — and for I/O-bound workloads, the 3–4x smaller data size more than compensates.

Implementation detail: skipping the WKB roundtrip

A key optimization: DuckDB’s internal GEOMETRY format is WKB. The Geometry::ToBinary() function is literally a no-op pointer reinterpret:

void Geometry::ToBinary(Vector &source, Vector &result, idx_t count) {
    // We are currently using WKB internally, so just copy as-is!
    result.Reinterpret(source);
}

This means GeoSilo can read GEOMETRY bytes directly as WKB on encode, and write WKB bytes directly to the GEOMETRY result on decode — no intermediate serialization step. Combined with raw pointer writes to a pre-sized buffer (eliminating per-coordinate vector::resize overhead), this brings throughput to ~3 GB/s.

How TWKB compares

TWKB (Tiny Well-Known Binary) is the closest prior art to GeoSilo. Both delta-encode scaled integer coordinates, but they make different bets about the compression pipeline:

TWKBGeoSilo
Delta encodingZigZag + varint (variable-length, like protobuf)Fixed int16 with 6-byte escape
Smallest delta1 byte2 bytes
Large deltaGrows gradually (2, 3, 4… bytes)Jumps to 6 bytes
Delta chainContinuous across rings/parts in Multi* typesResets per ring
Ring closureImplicit (first point not repeated)Explicit (preserves WKB convention)
Optional metadataBounding boxes, sizes, ID listsNone
Arrow integrationNoneRegisters as extension type queryfarm.geosilo

TWKB minimizes standalone byte size — variable-length encoding means small deltas take fewer bytes. GeoSilo minimizes compressed byte size — fixed-width int16 deltas create more repetitive patterns for ZSTD even though the raw blobs are slightly larger.

GeoSilo’s per-ring delta reset also means you can decode any ring independently, which TWKB can’t do since its delta chain is continuous across ring boundaries within a geometry.

When each approach wins

GeoArrow + BSS is the better choice when:

  • Lossless float64 precision is required — surveying, scientific, or regulatory use cases where ~1 cm precision loss is unacceptable.
  • Columnar analytics over coordinates — computing bounding boxes across millions of geometries, aggregating coordinate values, or filtering by coordinate range without decoding entire geometries.
  • Zero-copy access — workloads that benefit from directly addressable coordinates with no decode step.
  • Standard ecosystem — you want any GeoParquet reader to access the data without a special extension.

GeoSilo is the better choice when:

  • Storage or bandwidth is the bottleneck — ~2× better compression makes a real difference at scale.
  • Mixed geometry types — a single column can hold Points, Polygons, and GeometryCollections.
  • Arrow IPC transport — GeoSilo registers as an Arrow extension type, so DuckDB auto-encodes on export and auto-decodes on import. You get compact wire format without changing query code.
  • Row-level spatial operations — if you’re running ST_Area, ST_Intersects, or spatial joins, you’re decoding one geometry at a time anyway. The decode cost of int16 deltas is negligible compared to the spatial computation.

TWKB is the better choice when:

  • No external compression layer — serving individual geometries over HTTP (WFS responses, tile endpoints) where you can’t apply ZSTD to the transport.
  • PostGIS ecosystem — TWKB has native support in PostGIS via ST_AsTWKB.

A theoretical best of both worlds

You could combine GeoSilo’s delta-integer encoding with GeoArrow’s columnar layout — flatten all deltas into contiguous int16 arrays, then apply BSS + ZSTD on those. Research from cbloom confirms that delta + byte-deinterleave beats either approach alone. But nobody has built this yet, and it would require a new Arrow/Parquet encoding that understands geometry structure.

For now, the practical choice is between lossless columnar convenience (GeoArrow + BSS) and maximum compression with DuckDB integration (GeoSilo). Both are significant improvements over raw WKB.


GeoSilo is open source and available as a DuckDB community extension:

INSTALL geosilo FROM community;
LOAD geosilo;

Source: github.com/Query-farm/geosilo

#GeoSilo #GeoArrow #DuckDB #Parquet #Compression #Geospatial #Arrow #Query.Farm

Related Posts