Preparing a Dataset

Preparing a Dataset

This document walks through the full process of preparing a new non-synthetic dataset for benchmarking. The high-level steps are:

Load source data into S3
Create a config and sample the data at each size you need
Convert the sampled parquet data to CSV

Step 1: Load Source Data into S3

Upload your source data as partitioned parquet files to S3. Each table should be in its own subdirectory under a source/parquet/ path (filename doesn’t matter):

s3://paradedb-ci-benchmark/datasets/{dataset-name}/source/parquet/
├── {table_a}/
│   ├── part-0001.parquet
│   ├── part-0002.parquet
│   └── ...
├── {table_b}/
│   └── ...
└── {table_c}/
    └── ...

For example, for the Stack Overflow dataset:

s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/
├── stackoverflow_posts/
├── comments/
└── users/

Step 2: Create a Config and Sample

Writing the Config

Create a TOML config file at datasets/{dataset-name}/config.toml that describes your table relationships. The config specifies which table is the root (the one that gets sampled directly) and how child tables relate to it via joins.

root_table = "root_table_name"
sampling_seed = 723               # Fixed seed for deterministic results

[[tables]]
name = "root_table_name"

[[tables]]
name = "child_table"
parent = "root_table_name"
parent_join_col = "id"            # Column in the parent table
join_col = "parent_id"            # Corresponding column in the child table

Fields:

root_table – The primary table. The --rows argument controls how many rows are sampled from this table.
sampling_seed – Seed for deterministic, reproducible sampling.
[[tables]] – One entry per table. The root table has no parent. Child tables specify parent, parent_join_col, and join_col to define the relationship.

Child tables are not sampled independently. Instead, they are filtered via an inner join with their parent table, so only rows that reference a sampled parent row are kept. This preserves referential integrity across the dataset.

Tables can form a hierarchy (a child can be a parent of another table). They are processed in topological order.

See datasets/stackoverflow/config.toml for a real example.

Running the Sampling Tool

Run the sample command once for each dataset size you need. The --rows argument sets the target row count for the root table. Output goes to the sampled/{size}/parquet/ path.

# Sample to 10k rows
cargo run --release -- sample \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/10k/parquet/ \
  --config ./datasets/stackoverflow/config.toml \
  --rows 10000

# Sample to 100k rows
cargo run --release -- sample \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/100k/parquet/ \
  --config ./datasets/stackoverflow/config.toml \
  --rows 100000

# Sample to 1m rows
cargo run --release -- sample \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/1m/parquet/ \
  --config ./datasets/stackoverflow/config.toml \
  --rows 1000000

Notes:

The output path must be empty (no pre-existing data).
For small targets (<=100k rows), sampling is exact using reservoir sampling. For larger targets, it uses system sampling and the result will be approximate (within ~3-5%).
Use --dry-run to validate inputs and see planned row counts without writing anything.

Step 3: Convert Sampled Data to CSV

Run the convert command for each sampled size to produce CSV versions. The --tables flag takes a comma-separated list of all tables to convert.

# Convert 10k sampled data
cargo run --release -- convert \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/10k/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/10k/csv/ \
  --tables stackoverflow_posts,comments,users

# Convert 100k sampled data
cargo run --release -- convert \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/100k/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/100k/csv/ \
  --tables stackoverflow_posts,comments,users

# Convert 1m sampled data
cargo run --release -- convert \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/1m/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/1m/csv/ \
  --tables stackoverflow_posts,comments,users

You can also convert the full source data:

cargo run --release -- convert \
  --input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
  --output s3://paradedb-ci-benchmark/datasets/stackoverflow/source/csv/ \
  --tables stackoverflow_posts,comments,users

Notes:

The output path must be empty.
Row counts are verified after conversion to ensure no data is lost.
Use --dry-run to validate without writing.
AWS credentials must be accessible via the standard credential chain (env vars, ~/.aws/credentials, or instance metadata).

Final S3 Layout

After completing all three steps, your dataset will look like this:

s3://paradedb-ci-benchmark/datasets/{dataset-name}/
├── source/
│   ├── parquet/
│   │   ├── {table_a}/
│   │   ├── {table_b}/
│   │   └── {table_c}/
│   └── csv/
│       ├── {table_a}/
│       ├── {table_b}/
│       └── {table_c}/
└── sampled/
    ├── 10k/
    │   ├── parquet/
    │   │   ├── {table_a}/
    │   │   ├── {table_b}/
    │   │   └── {table_c}/
    │   └── csv/
    │       ├── {table_a}/
    │       ├── {table_b}/
    │       └── {table_c}/
    ├── 100k/
    │   └── ...
    └── 1m/
        └── ...

PGXN

PostgreSQL Extension Network

Contents