GeoSilo vs GeoArrow + Byte Stream Split: Two Approaches to Geometry Compression
A detailed comparison of GeoSilo's delta-integer encoding against GeoArrow's columnar float64 layout with Parquet BYTE_STREAM_SPLIT, covering compression ratios, precision tradeoffs, and when each approach wins.
Geometry data in DuckDB and GeoParquet is typically stored as WKB — raw IEEE 754 float64 coordinate pairs. This is simple and lossless, but wastes space because adjacent vertices in a polygon share nearly identical coordinate values. Two very different strategies have emerged for fixing this: GeoArrow with Parquet’s BYTE_STREAM_SPLIT encoding, and GeoSilo, a DuckDB extension I built at Query.Farm.
This post also covers how both relate to TWKB (Tiny Well-Known Binary), the prior art in compact geometry encoding.
How each approach works
GeoArrow + BYTE_STREAM_SPLIT + ZSTD
This is a columnar, lossless pipeline:
- GeoArrow flattens all vertices from all geometries into contiguous
double[]arrays — one for X, one for Y (the “separated” layout mandated by GeoParquet). - BYTE_STREAM_SPLIT transposes each array by byte position: all byte-0s together, all byte-1s together, creating 8 separate byte streams per coordinate column.
- Spatially clustered coordinates share high-order bytes, so streams 6–7 (sign + exponent) become highly repetitive.
- ZSTD compresses each byte stream efficiently.
The key insight is that IEEE 754 float64 values for nearby geographic coordinates differ mostly in their low-order mantissa bytes. Separating the bytes lets ZSTD apply different compression models to the stable high bytes and the noisier low bytes.
GeoSilo + ZSTD
This is a blob-per-row, lossy pipeline:
- Scale coordinates to integers (e.g.,
−75.509491 × 1e7 = −755094910). - Store the first vertex per ring as absolute int32 (4 bytes per dimension).
- Store subsequent vertices as int16 deltas from the previous value (2 bytes per dimension).
- ~90% of deltas fit in int16; the remaining ~10% use a 6-byte escape (sentinel + int32).
- The fixed-width int16 deltas create highly repetitive byte patterns that ZSTD compresses further.
The data shrinks before the codec ever sees it — from 16 bytes per XY pair down to 4 bytes for ~90% of vertices.
Head-to-head comparison
| GeoArrow + BSS + ZSTD | GeoSilo + ZSTD | |
|---|---|---|
| Compression vs WKB | ~0.5–0.6× | ~0.26–0.28× |
| Precision | Lossless float64 | Lossy (~1 cm at 1e7 scale) |
| Pre-compression size | Same as input (BSS is a transpose, not a reduction) | ~0.30× of input |
| Geometry type constraint | Homogeneous per column (all Polygon, all LineString, etc.) or fall back to WKB | Any type per blob, including GeometryCollection |
| Decode cost | Zero-copy — coordinates are directly addressable | ~3 GB/s single-threaded (int16 deltas → float64) |
| Random vertex access | O(1) via offset arrays | Must scan from ring start |
| Spatial locality requirement | Depends on row-group ordering | Exploits intra-geometry adjacency (always local) |
| Ecosystem | Native to Parquet/Arrow | Requires GeoSilo DuckDB extension |
The GeoArrow + BSS compression numbers come from the PARQUET-2414 benchmarks on Belgium OSM geographic data, which showed ~1.7–2.1× compression for float64 lat/lon with BSS + ZSTD vs plain ZSTD alone. The GeoSilo numbers come from benchmarking against US Census TIGER/Line 2025 data.
Why GeoSilo compresses ~2× better
The fundamental difference is where delta encoding happens.
BSS exploits inter-geometry byte similarity: coordinates in the same row group share high-order bytes because they’re in the same geographic region. But this signal depends entirely on how the data is ordered. Scattered global data with no spatial clustering sees minimal benefit.
GeoSilo exploits intra-geometry vertex adjacency: consecutive vertices in a polygon ring differ by tiny amounts by definition. This signal is structural, not dependent on row ordering. A polygon in Tokyo and a polygon in New York both have small vertex-to-vertex deltas — the absolute position doesn’t matter, only the relative movement.
This also explains why GeoSilo reduces data width before compression. Each XY coordinate pair goes from 16 bytes (two float64s) to 4 bytes (two int16 deltas) for the ~90% common case. BSS keeps the data at 8 bytes per coordinate and relies entirely on ZSTD to find patterns in the transposed byte streams.
Is lossy encoding actually a problem?
GeoSilo’s compression advantage comes from scaling float64 coordinates to integers — a lossy step. The word “lossy” sounds alarming, but in practice the precision loss is well below the noise floor of the source data.
What the numbers actually mean
At the default scale of 1e7, GeoSilo preserves coordinates to 7 decimal places in degrees. That’s a grid spacing of about 1.1 cm at the equator (0.0000001° × 111,320 m/°). For projected coordinate systems in meters, the default scale of 100 gives 1 cm precision.
To put that in context:
| Data source | Typical accuracy | GeoSilo precision |
|---|---|---|
| Consumer GPS (phone) | 3–5 m | 1.1 cm |
| Survey-grade GPS (RTK) | 1–2 cm | 1.1 cm |
| US Census TIGER/Line | 2–5 m | 1.1 cm |
| OpenStreetMap | 1–10 m (varies wildly) | 1.1 cm |
| Satellite imagery (30 cm) | 30 cm | 1.1 cm |
| Cadastral/land survey | 1–10 cm | 1.1 cm |
In every case except the most precise land surveys, the source data has error bars that are 10–500× larger than GeoSilo’s quantization step. The extra decimal places in a float64 coordinate are not precision — they’re noise from coordinate transformations, reprojection artifacts, or simply the way floating-point arithmetic works.
Float64 is already lying to you
A common misconception is that float64 coordinates are “exact.” They aren’t. IEEE 754 float64 gives you about 15–16 significant decimal digits, but geographic coordinates don’t have 16 digits of meaningful precision. Consider:
- The coordinate
-75.5094913482761has 16 digits, but the last 8–9 are below the accuracy of any terrestrial survey method. - Reprojecting a coordinate from one CRS to another introduces rounding at the float64 level — the output is not bit-identical to the “true” value.
- Spatial operations like
ST_Buffer,ST_Intersection, andST_Simplifyall introduce coordinate perturbations far larger than 1 cm.
Preserving all 52 mantissa bits of a float64 geographic coordinate is preserving the format’s representational capacity, not the data’s actual information content.
When lossless genuinely matters
There are legitimate cases where bit-exact float64 roundtripping matters:
- Computational geometry correctness — algorithms like Shewchuk’s robust predicates depend on exact floating-point arithmetic. If you’re feeding coordinates into a constrained Delaunay triangulation, you want the exact bits back.
- Legal/regulatory requirements — some jurisdictions mandate that official boundary coordinates must not be modified in transit.
- Hash-based deduplication — if you’re comparing geometries by hash to detect duplicates, any coordinate change breaks the comparison.
- Differential updates — if downstream systems diff geometry to detect changes, quantization noise would trigger false positives.
For these cases, a lossless approach would achieve roughly 0.38–0.42× of WKB with ZSTD — about 40–50% larger than GeoSilo’s lossy encoding, but still better than GeoArrow + BSS (~0.5–0.6×). The technique would be similar: delta-encode the raw float64 bit patterns as int64, where adjacent vertices produce small deltas that compress well. Nobody has built this yet, but it’s a natural extension of the same idea.
The practical tradeoff
For the vast majority of geospatial workloads — visualization, spatial joins, area/distance calculations, geocoding, routing — 1 cm precision is far more than sufficient, and the 2× compression improvement over lossless approaches directly translates to faster I/O, lower storage costs, and smaller network payloads. The “lossy” label is technically accurate but practically misleading: you’re discarding noise, not signal.
Encode/decode performance
Compression ratio is only half the story — if encoding or decoding is slow, the storage savings don’t help. Here’s what GeoSilo’s throughput looks like on US Census TIGER/Line 2025 data (single-threaded, Apple M-series):
| Table | Rows | WKB Size | Encode | Decode |
|---|---|---|---|---|
| block_group | 242,748 | 209 MB | 75 ms (2,789 MB/s) | 71 ms (2,946 MB/s) |
| zcta5 | 33,791 | 180 MB | 49 ms (3,673 MB/s) | 47 ms (3,830 MB/s) |
| tract | 85,529 | 125 MB | 41 ms (3,039 MB/s) | 39 ms (3,195 MB/s) |
| county | 3,235 | 25 MB | 7 ms (3,514 MB/s) | 7 ms (3,514 MB/s) |
At ~3 GB/s, GeoSilo encode/decode is not the bottleneck in any realistic pipeline — disk I/O, network transfer, or the spatial computation itself will dominate.
End-to-end spatial operation overhead
The more relevant benchmark is: how much does decode add to actual spatial operations?
| Table | Rows | Raw ST_Area | Decode + ST_Area | Overhead |
|---|---|---|---|---|
| block_group | 242,748 | 18 ms | 83 ms | +361% |
| county | 3,235 | 2 ms | 8 ms | +300% |
The overhead looks large in percentage terms, but that’s because ST_Area on DuckDB’s native GEOMETRY format is extremely fast (18 ms for 242k complex polygons). In absolute terms, decoding 209 MB of geometry takes 71 ms — and for I/O-bound workloads, the 3–4x smaller data size more than compensates.
Implementation detail: skipping the WKB roundtrip
A key optimization: DuckDB’s internal GEOMETRY format is WKB. The Geometry::ToBinary() function is literally a no-op pointer reinterpret:
void Geometry::ToBinary(Vector &source, Vector &result, idx_t count) {
// We are currently using WKB internally, so just copy as-is!
result.Reinterpret(source);
}
This means GeoSilo can read GEOMETRY bytes directly as WKB on encode, and write WKB bytes directly to the GEOMETRY result on decode — no intermediate serialization step. Combined with raw pointer writes to a pre-sized buffer (eliminating per-coordinate vector::resize overhead), this brings throughput to ~3 GB/s.
How TWKB compares
TWKB (Tiny Well-Known Binary) is the closest prior art to GeoSilo. Both delta-encode scaled integer coordinates, but they make different bets about the compression pipeline:
| TWKB | GeoSilo | |
|---|---|---|
| Delta encoding | ZigZag + varint (variable-length, like protobuf) | Fixed int16 with 6-byte escape |
| Smallest delta | 1 byte | 2 bytes |
| Large delta | Grows gradually (2, 3, 4… bytes) | Jumps to 6 bytes |
| Delta chain | Continuous across rings/parts in Multi* types | Resets per ring |
| Ring closure | Implicit (first point not repeated) | Explicit (preserves WKB convention) |
| Optional metadata | Bounding boxes, sizes, ID lists | None |
| Arrow integration | None | Registers as extension type queryfarm.geosilo |
TWKB minimizes standalone byte size — variable-length encoding means small deltas take fewer bytes. GeoSilo minimizes compressed byte size — fixed-width int16 deltas create more repetitive patterns for ZSTD even though the raw blobs are slightly larger.
GeoSilo’s per-ring delta reset also means you can decode any ring independently, which TWKB can’t do since its delta chain is continuous across ring boundaries within a geometry.
When each approach wins
GeoArrow + BSS is the better choice when:
- Lossless float64 precision is required — surveying, scientific, or regulatory use cases where ~1 cm precision loss is unacceptable.
- Columnar analytics over coordinates — computing bounding boxes across millions of geometries, aggregating coordinate values, or filtering by coordinate range without decoding entire geometries.
- Zero-copy access — workloads that benefit from directly addressable coordinates with no decode step.
- Standard ecosystem — you want any GeoParquet reader to access the data without a special extension.
GeoSilo is the better choice when:
- Storage or bandwidth is the bottleneck — ~2× better compression makes a real difference at scale.
- Mixed geometry types — a single column can hold Points, Polygons, and GeometryCollections.
- Arrow IPC transport — GeoSilo registers as an Arrow extension type, so DuckDB auto-encodes on export and auto-decodes on import. You get compact wire format without changing query code.
- Row-level spatial operations — if you’re running
ST_Area,ST_Intersects, or spatial joins, you’re decoding one geometry at a time anyway. The decode cost of int16 deltas is negligible compared to the spatial computation.
TWKB is the better choice when:
- No external compression layer — serving individual geometries over HTTP (WFS responses, tile endpoints) where you can’t apply ZSTD to the transport.
- PostGIS ecosystem — TWKB has native support in PostGIS via
ST_AsTWKB.
A theoretical best of both worlds
You could combine GeoSilo’s delta-integer encoding with GeoArrow’s columnar layout — flatten all deltas into contiguous int16 arrays, then apply BSS + ZSTD on those. Research from cbloom confirms that delta + byte-deinterleave beats either approach alone. But nobody has built this yet, and it would require a new Arrow/Parquet encoding that understands geometry structure.
For now, the practical choice is between lossless columnar convenience (GeoArrow + BSS) and maximum compression with DuckDB integration (GeoSilo). Both are significant improvements over raw WKB.
GeoSilo is open source and available as a DuckDB community extension:
INSTALL geosilo FROM community;
LOAD geosilo;
Source: github.com/Query-farm/geosilo