DuckDB

Enhancing DuckDB with Unix Pipe Integration: Introducing the shellfs Extension

Introducing shellfs, a DuckDB extension that enables seamless integration with Unix pipes for both input and output, allowing command-line programs to be used directly within DuckDB queries.

I created a new DuckDB extension called “shellfs” that enables seamless integration with Unix pipes for both input and output. This feature allows users to harness the power of command-line programs and scripts directly within DuckDB, enhancing its flexibility and efficiency in processing complex data tasks.

You can install or build the extension from GitHub.

The creation of this extension was part of my original posting about how DuckDB is Strategically Important.

Key Features of the shellfs Extension

Input and Output Piping

Input Pipe: By appending a pipe character | to a filename, DuckDB treats the filename as a series of commands to execute, capturing their output.

Output Pipe: Prefixing a filename with | directs DuckDB to treat it as an output pipe, enabling direct command-line processing of query results.

Practical Applications

While the provided examples are straightforward and quite simple, the potential uses are vast. With this extension you can now use external programs to generate CSV, JSON, or other formats without creating temporary files. This makes it easier to integrate DuckDB into workflows that handle complexities beyond DuckDB’s native format capabilities of (CSV, JSON, and Parquet).

Examples

Filter a sequence for numbers containing 2

SELECT * FROM read_csv('seq 1 100 | grep 2 |');

Chaining commands to get the first two multiples of 7.

SELECT * FROM read_csv('seq 7 7 700 | head -2 |');

Requesting data via curl

SELECT * FROM read_csv('curl -s https://example.com/data.csv |');

Generating CSV via Python

Creating a CSV output in test-csv.py.

import csv
import sys

writer = csv.writer(sys.stdout)
writer.writerow(["name", "value"])
for i in range(100):
    writer.writerow([f"item_{i}", i * 7])

Run that program and determine the number of distinct values it produces:

SELECT count(DISTINCT value) FROM read_csv('python3 test-csv.py |');

Writing Output to a Pipe

Filter And Save Numbers Containing 6

COPY (SELECT * FROM range(100) WHERE range::VARCHAR LIKE '%6%')
TO '| cat > sixes.csv' (FORMAT CSV);

Copy Result to Clipboard on Mac OS X

COPY (SELECT 'hello' AS greeting) TO '| pbcopy' (FORMAT CSV);

Encrypt Output with OpenSSL

COPY (SELECT 'secret data' AS data) TO '| openssl enc -aes-256-cbc -out encrypted.bin' (FORMAT CSV);

Configuration: Handling SIGPIPE Signals

The shellfs extension introduces a new configuration option, ignore_sigpipe, which when set to true, prevents DuckDB from exiting on a SIGPIPE signal. This is particularly useful when writing to a pipe that stops reading input. Enable this option with:

SET ignore_sigpipe = true;

Example of generating a SIGPIPE signal on output:

COPY (SELECT * FROM range(1000000)) TO '| head -1 > /dev/null' (FORMAT CSV);

By leveraging the shellfs extension, DuckDB users can now integrate Unix command-line prowess directly into their data workflows, unlocking new levels of data processing and manipulation efficiency.

#DuckDB #Extensions #Unix #Shell