weighted_statistics 1.0.0

This Release

Date

2025-08-28

Status

Stable

Abstract

High-performance weighted statistics functions for sparse data

Description

A PostgreSQL extension providing 7 weighted statistical functions (mean, variance, std deviation, and 3 quantile methods) optimized for sparse data where sum(weights) < 1.0 implies implicit zeros. Offers up to 14x performance improvement over PL/pgSQL implementations through optimized C code. Based on Akinshin (2023) weighted quantile methods.

Released By

schmidni

License

PostgreSQL

Resources

Special Files

Tags

Extensions

weighted_statistics 1.0.0: High-performance weighted statistics functions for sparse data

Documentation

CHANGELOG: Changelog
README: Performance Benchmarks
requirements: requirements

README

Weighted Statistics PostgreSQL Extension

High-performance PostgreSQL extension providing weighted statistical functions optimized for sparse data (automatic handling when sum(weights) < 1.0).

Quick Start

git clone https://github.com/your-repo/weighted_statistics
cd weighted_statistics
make clean && make && sudo make install
psql -c "CREATE EXTENSION weighted_statistics;"

Available Functions:

weighted_mean(values[], weights[]) - Weighted mean
weighted_variance(values[], weights[], ddof) - Weighted variance (ddof: 0=population, 1=sample)
weighted_std(values[], weights[], ddof) - Weighted standard deviation
weighted_quantile(values[], weights[], quantiles[]) - Empirical CDF quantiles
wquantile(values[], weights[], quantiles[]) - Type 7 (Hyndman-Fan) quantiles
whdquantile(values[], weights[], quantiles[]) - Harrell-Davis quantiles
weighted_median(values[], weights[]) - 50th percentile shortcut for empirical CDF

-- Basic usage examples
SELECT weighted_mean(ARRAY[1.0, 2.0, 3.0], ARRAY[0.2, 0.3, 0.5]);
-- Result: 2.3

SELECT weighted_quantile(ARRAY[10.0, 20.0, 30.0], ARRAY[0.3, 0.4, 0.3], ARRAY[0.25, 0.5, 0.75]);
-- Result: {15.0, 20.0, 25.0}

Key Features

up to 14x faster than PL/pgSQL implementations
Sparse data handling: Automatically adds implicit zeros when sum(weights) < 1.0
Mathematically validated: 100% accuracy against Python reference implementations

Requirements: PostgreSQL 12+, development headers (postgresql-server-dev-*), C compiler

Usage Examples

-- multiple statistics
WITH example_data AS (
    SELECT array_agg(val) AS vals, array_agg(weight) AS weights
    FROM risk_events
)
SELECT 
    weighted_mean(vals, weights) AS expected_vals,
    weighted_std(vals, weights, 1) AS vals_volatility,
    weighted_quantile(vals, weights, ARRAY[0.05, 0.95]) AS vals_bounds
FROM example_data;

-- Sparse data handling
-- These are equivalent:
SELECT weighted_mean(ARRAY[10, 20, 0, 0], ARRAY[0.2, 0.3, 0.25, 0.25]);
SELECT weighted_mean(ARRAY[10, 20], ARRAY[0.2, 0.3]); -- Auto-adds 0.5 weight of zeros
-- Both return: 8.0

Function Comparison

Testing & Validation

Multi-tier testing ensures both mathematical correctness and behavioral consistency:

# Mathematical validation (against Python reference)
cd reference && python validate_against_reference.py --database test_db

# Regression testing 
make installcheck
# OR alternative: ./test/run_tests.sh

# Performance benchmarking
./benchmark/run_benchmark.sh

Validation Results: 100% accuracy against Python reference implementations following Akinshin (2023) methods.

Performance

Benchmarked results (5-iteration averages) show significant performance advantages:

up to 14x faster than optimized PL/pgSQL (varies by function and array size)
Sub-millisecond to millisecond execution: 0.06ms (1K quantiles) to 18.77ms (100K mean PL/pgSQL)
Quantiles show biggest gains: Consistent 12-14x faster than PL/pgSQL across all sizes
Mean functions scale differently: Equal at 1K elements, 4x faster at 100K elements
Efficient quantile methods: Empirical CDF fastest, Type 7 ~1.7x slower, Harrell-Davis 25-33x slower

For more details, see the benchmark/README.md.

Development & Testing

# Build
make clean && make && sudo make install

# Test mathematical correctness
python reference/validate_against_reference.py

# Run regression tests
make installcheck

# Benchmark performance  
./benchmark/run_benchmark.sh

Troubleshooting: Install dev headers with sudo apt-get install postgresql-server-dev-$(pg_config --version | grep -oP '\d+')

Disclaimer

The C implementation and PostgreSQL extension setup were generated by AI assistance, based on the self-written Python reference implementations in the reference/ directory. These Python implementations follow the weighted statistics methods described in Akinshin (2023).

All functions have been rigorously tested against the Python reference implementations to ensure mathematical correctness. The validation harness at reference/validate_against_reference.py compares results with high precision tolerance (typically 1e-10) across diverse test cases.

Reference: Akinshin, A. (2023). Weighted quantile estimators. arXiv preprint.

Production-ready PostgreSQL extension providing 7 mathematically validated weighted statistics functions with high performance and automatic sparse data handling.

PGXN

PostgreSQL Extension Network

weighted_statistics 1.0.0

Extensions

Documentation

README

Contents

Weighted Statistics PostgreSQL Extension

Quick Start

Key Features

Usage Examples

Function Comparison

Testing & Validation

Performance

Development & Testing

Disclaimer