Quarto ojs DuckDB with a parquet file stored in a S3 storage

Author

Romain Lesur

Context

We assume that we are in the following context: a parquet file stored in MinIO and publicly available.
Since this file is publicly available, we do not want to attach it in the Quarto notebook.

The penguins dataset is available at https://minio.lab.sspcloud.fr/f7sggu/diffusion/penguins.parquet.

DuckDB configuration

DuckDB can be configured to read from a S3 compatible storage, see https://duckdb.org/docs/sql/configuration.html.

Here, since the parquet file is publicly available, we only need to override the s3_endpoint variable:

configuredClient = {
  const client = await DuckDBClient.of();
  await client.sql`
    SET s3_endpoint='minio.lab.sspcloud.fr'
  `;
  return client;
}

Read a parquet file

Since the S3 endpoint is properly configured, the address of the parquet file is now:

url = 's3://f7sggu/diffusion/penguins.parquet'

As explained in the DuckDB help page on parquet, we can create a view from a parquet file:

db = {
  await configuredClient.query(`
    CREATE VIEW penguins AS 
    SELECT * FROM read_parquet('${url}');
  `);
  return configuredClient
}

Request the parquet file

Now, we can request the parquet file:

res = db.sql`
  SELECT * FROM penguins LIMIT 10
`
viewof table = Inputs.table(res)

Configuration informations

This notebook was rendered with Quarto 1.3.49

Source

See https://github.com/RLesur/quarto-ojs-parquet-s3.