Releasing vgi-rpc: An RPC Framework Built on Apache Arrow

I’ve been building data services at Query.Farm for a while now, and one thing that kept bothering me was how much ceremony is involved in defining an RPC interface. With gRPC you need .proto files, a code generation step, and you’re locked into HTTP/2. With JSON-over-HTTP you get simplicity but give up type safety and pay a real serialization cost — especially when you’re moving columnar data around.

I wanted something where I could just write a Python Protocol class with type annotations and have the framework figure out the rest. That’s what vgi-rpc is.

The Basic Idea

You define your service as a Protocol class. vgi-rpc derives Apache Arrow schemas directly from the type annotations — no .proto files, no codegen step, no schema registry.

from typing import Protocol
from vgi_rpc import serve_pipe

class Calculator(Protocol):
    def add(self, a: float, b: float) -> float: ...
    def multiply(self, a: float, b: float) -> float: ...

class CalculatorImpl:
    def add(self, a: float, b: float) -> float:
        return a + b
    def multiply(self, a: float, b: float) -> float:
        return a * b

with serve_pipe(Calculator, CalculatorImpl()) as proxy:
    print(proxy.add(a=2.0, b=3.0))       # 5.0
    print(proxy.multiply(a=4.0, b=5.0))  # 20.0

The proxy object is fully typed, so your IDE gives you autocompletion and type checking as if you were calling local methods. That was important to me — I didn’t want to lose the developer experience just because calls are crossing a process boundary.

Why Arrow?

I’ve been working with Apache Arrow for years at this point, and it’s the natural choice for a wire format if you care about data-heavy workloads. Arrow IPC is a well-specified, language-agnostic binary format with zero-copy read support. It’s what powers data interchange in DuckDB, Polars, pandas, and dozens of other tools.

Using Arrow as the wire format means:

No JSON parsing tax — Arrow batches can be memory-mapped directly.
Columnar layout — batch workloads benefit from cache-friendly access patterns when processing many rows.
Cross-language interop — the wire format is documented and self-contained. You could build a client in Rust, Go, or any language with Arrow support.
Ecosystem compatibility — the batches you send and receive work natively with PyArrow, pandas, Polars, and DuckDB.

At Query.Farm, we’re moving structured data between services constantly. Arrow eliminates an entire class of serialization problems.

Transport Agnostic

One of the design decisions I’m happiest with is that the same service definition works unchanged across five different transports:

Pipe — in-process, runs in a background thread. I use this constantly for testing.
Subprocess — communicates over stdin/stdout. Useful for isolating code or running in a different Python environment.
Unix domain sockets — low-latency local IPC between processes.
Shared memory — zero-copy batch transfer using a bump-pointer allocator. Only pointer metadata crosses the pipe, the actual batch data stays in shared memory.
HTTP — Falcon WSGI on the server, httpx on the client. Deploy as a standard web service.

Switching transports is a one-line change. The service implementation doesn’t know or care how it’s being called, which makes it easy to develop locally with pipes and deploy over HTTP or Unix sockets without touching the service code.

Streaming

Beyond standard unary request/response, vgi-rpc supports two streaming patterns that I found essential for the kinds of data services I build:

Producer streams let the server push multiple batches to the client. The client just iterates until the stream completes:

from dataclasses import dataclass
from vgi_rpc import ProducerState

@dataclass
class CountdownState(ProducerState):
    n: int

    def produce(self, out, ctx):
        if self.n <= 0:
            out.finish()
            return
        out.emit_pydict({"value": [self.n]})
        self.n -= 1

# Client side
for batch in proxy.countdown(n=5):
    print(batch.batch.to_pydict())

Exchange streams are bidirectional but in lockstep — the client sends data, the server responds, one round at a time. Only one side is active at any given moment, which eliminates the need for buffering or locking. I use this pattern for incremental data processing where backpressure matters.

Structured Types

Python dataclasses map directly to Arrow structs. Enums become dictionaries. Lists, dicts, and optionals all work the way you’d expect:

from dataclasses import dataclass
from enum import Enum
from vgi_rpc import ArrowSerializableDataclass

class Priority(Enum):
    LOW = "low"
    HIGH = "high"

@dataclass(frozen=True)
class Task(ArrowSerializableDataclass):
    name: str
    priority: Priority
    tags: list[str]

If the automatic type mapping doesn’t do what you want, Annotated[T, ArrowType(...)] lets you override the Arrow type explicitly. In practice I rarely need this, but it’s there when the defaults aren’t right.

The Rest of It

There’s a lot more in vgi-rpc that I built because I actually needed it in production:

Authentication — pluggable AuthContext middleware for HTTP. JWT, API keys, whatever your stack needs.
Runtime introspection — an optional __describe__ method lets clients discover available methods and schemas without having the Protocol class. Useful for tooling and debugging.
CLI tools — vgi-rpc describe and vgi-rpc call let you poke at services from the command line without writing any code.
Large batch externalization — batches that exceed a configurable size get automatically uploaded to S3 or GCS and replaced with lightweight pointer batches. The client fetches them transparently with parallel range requests.
Per-call statistics — CallStatistics tracks input/output batches, rows, and bytes. Essential for usage accounting.
OpenTelemetry — optional tracing and metrics with context propagation.
Wire protocol debugging — structured logging at the vgi_rpc.wire hierarchy shows exactly what’s flowing over the wire when things go wrong.
Error propagation — remote exceptions arrive as RpcError with the original type, message, and traceback. A failed call doesn’t poison the connection.

None of these features were designed in advance. They all came from running real services and hitting real problems.

Get It

vgi-rpc requires Python 3.13+ and is licensed under Apache 2.0.

pip install vgi-rpc

Optional extras for specific transports and features:

pip install vgi-rpc[http]       # HTTP transport
pip install vgi-rpc[s3]         # S3 external storage
pip install vgi-rpc[gcs]        # GCS external storage
pip install vgi-rpc[cli]        # CLI tools
pip install vgi-rpc[otel]       # OpenTelemetry instrumentation

Documentation: vgi-rpc.query.farm

Source code:

Python: github.com/Query-farm/vgi-rpc-python
TypeScript: github.com/Query-farm/vgi-rpc-typescript
C++: github.com/Query-farm/vgi-rpc-cpp
Go: github.com/Query-farm/vgi-rpc-go

One More Thing

vgi-rpc is part of a larger story. There’s more to VGI that I’m not quite ready to talk about yet, but it’s coming soon. Stay tuned.