Contents
Preparing a Dataset
This document walks through the full process of preparing a new non-synthetic dataset for benchmarking. The high-level steps are:
- Load source data into S3
- Create a config and sample the data at each size you need
- Convert the sampled parquet data to CSV
Step 1: Load Source Data into S3
Upload your source data as partitioned parquet files to S3. Each table should be in its own subdirectory under a source/parquet/ path (filename doesn’t matter):
s3://paradedb-ci-benchmark/datasets/{dataset-name}/source/parquet/
├── {table_a}/
│ ├── part-0001.parquet
│ ├── part-0002.parquet
│ └── ...
├── {table_b}/
│ └── ...
└── {table_c}/
└── ...
For example, for the Stack Overflow dataset:
s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/
├── stackoverflow_posts/
├── comments/
└── users/
Step 2: Create a Config and Sample
Writing the Config
Create a TOML config file at datasets/{dataset-name}/config.toml that describes your table relationships. The config specifies which table is the root (the one that gets sampled directly) and how child tables relate to it via joins.
root_table = "root_table_name"
sampling_seed = 723 # Fixed seed for deterministic results
[[tables]]
name = "root_table_name"
[[tables]]
name = "child_table"
parent = "root_table_name"
parent_join_col = "id" # Column in the parent table
join_col = "parent_id" # Corresponding column in the child table
Fields:
root_table– The primary table. The--rowsargument controls how many rows are sampled from this table.sampling_seed– Seed for deterministic, reproducible sampling.[[tables]]– One entry per table. The root table has noparent. Child tables specifyparent,parent_join_col, andjoin_colto define the relationship.
Child tables are not sampled independently. Instead, they are filtered via an inner join with their parent table, so only rows that reference a sampled parent row are kept. This preserves referential integrity across the dataset.
Tables can form a hierarchy (a child can be a parent of another table). They are processed in topological order.
See datasets/stackoverflow/config.toml for a real example.
Running the Sampling Tool
Run the sample command once for each dataset size you need. The --rows argument sets the target row count for the root table. Output goes to the sampled/{size}/parquet/ path.
# Sample to 10k rows
cargo run --release -- sample \
--input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
--output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/10k/parquet/ \
--config ./datasets/stackoverflow/config.toml \
--rows 10000
# Sample to 100k rows
cargo run --release -- sample \
--input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
--output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/100k/parquet/ \
--config ./datasets/stackoverflow/config.toml \
--rows 100000
# Sample to 1m rows
cargo run --release -- sample \
--input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
--output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/1m/parquet/ \
--config ./datasets/stackoverflow/config.toml \
--rows 1000000
Notes:
- The output path must be empty (no pre-existing data).
- For small targets (<=100k rows), sampling is exact using reservoir sampling. For larger targets, it uses system sampling and the result will be approximate (within ~3-5%).
- Use
--dry-runto validate inputs and see planned row counts without writing anything.
Step 3: Convert Sampled Data to CSV
Run the convert command for each sampled size to produce CSV versions. The --tables flag takes a comma-separated list of all tables to convert.
# Convert 10k sampled data
cargo run --release -- convert \
--input s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/10k/parquet/ \
--output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/10k/csv/ \
--tables stackoverflow_posts,comments,users
# Convert 100k sampled data
cargo run --release -- convert \
--input s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/100k/parquet/ \
--output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/100k/csv/ \
--tables stackoverflow_posts,comments,users
# Convert 1m sampled data
cargo run --release -- convert \
--input s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/1m/parquet/ \
--output s3://paradedb-ci-benchmark/datasets/stackoverflow/sampled/1m/csv/ \
--tables stackoverflow_posts,comments,users
You can also convert the full source data:
cargo run --release -- convert \
--input s3://paradedb-ci-benchmark/datasets/stackoverflow/source/parquet/ \
--output s3://paradedb-ci-benchmark/datasets/stackoverflow/source/csv/ \
--tables stackoverflow_posts,comments,users
Notes:
- The output path must be empty.
- Row counts are verified after conversion to ensure no data is lost.
- Use
--dry-runto validate without writing. - AWS credentials must be accessible via the standard credential chain (env vars,
~/.aws/credentials, or instance metadata).
Final S3 Layout
After completing all three steps, your dataset will look like this:
s3://paradedb-ci-benchmark/datasets/{dataset-name}/
├── source/
│ ├── parquet/
│ │ ├── {table_a}/
│ │ ├── {table_b}/
│ │ └── {table_c}/
│ └── csv/
│ ├── {table_a}/
│ ├── {table_b}/
│ └── {table_c}/
└── sampled/
├── 10k/
│ ├── parquet/
│ │ ├── {table_a}/
│ │ ├── {table_b}/
│ │ └── {table_c}/
│ └── csv/
│ ├── {table_a}/
│ ├── {table_b}/
│ └── {table_c}/
├── 100k/
│ └── ...
└── 1m/
└── ...