weighted_statistics 1.0.0

This Release
weighted_statistics 1.0.0
Date
Status
Stable
Abstract
High-performance weighted statistics functions for sparse data
Description
A PostgreSQL extension providing 7 weighted statistical functions (mean, variance, std deviation, and 3 quantile methods) optimized for sparse data where sum(weights) < 1.0 implies implicit zeros. Offers up to 14x performance improvement over PL/pgSQL implementations through optimized C code. Based on Akinshin (2023) weighted quantile methods.
Released By
schmidni
License
PostgreSQL
Resources
Special Files
Tags

Extensions

weighted_statistics 1.0.0
High-performance weighted statistics functions for sparse data

Documentation

CHANGELOG
Changelog
README
Performance Benchmarks
requirements
requirements

README

Weighted Statistics PostgreSQL Extension

High-performance PostgreSQL extension providing weighted statistical functions optimized for sparse data (automatic handling when sum(weights) < 1.0).

Quick Start

git clone https://github.com/your-repo/weighted_statistics
cd weighted_statistics
make clean && make && sudo make install
psql -c "CREATE EXTENSION weighted_statistics;"

Available Functions:

  • weighted_mean(values[], weights[]) - Weighted mean
  • weighted_variance(values[], weights[], ddof) - Weighted variance (ddof: 0=population, 1=sample)
  • weighted_std(values[], weights[], ddof) - Weighted standard deviation
  • weighted_quantile(values[], weights[], quantiles[]) - Empirical CDF quantiles
  • wquantile(values[], weights[], quantiles[]) - Type 7 (Hyndman-Fan) quantiles
  • whdquantile(values[], weights[], quantiles[]) - Harrell-Davis quantiles
  • weighted_median(values[], weights[]) - 50th percentile shortcut for empirical CDF
-- Basic usage examples
SELECT weighted_mean(ARRAY[1.0, 2.0, 3.0], ARRAY[0.2, 0.3, 0.5]);
-- Result: 2.3

SELECT weighted_quantile(ARRAY[10.0, 20.0, 30.0], ARRAY[0.3, 0.4, 0.3], ARRAY[0.25, 0.5, 0.75]);
-- Result: {15.0, 20.0, 25.0}

Key Features

  • up to 14x faster than PL/pgSQL implementations
  • Sparse data handling: Automatically adds implicit zeros when sum(weights) < 1.0
  • Mathematically validated: 100% accuracy against Python reference implementations

Requirements: PostgreSQL 12+, development headers (postgresql-server-dev-*), C compiler

Usage Examples

-- multiple statistics
WITH example_data AS (
    SELECT array_agg(val) AS vals, array_agg(weight) AS weights
    FROM risk_events
)
SELECT 
    weighted_mean(vals, weights) AS expected_vals,
    weighted_std(vals, weights, 1) AS vals_volatility,
    weighted_quantile(vals, weights, ARRAY[0.05, 0.95]) AS vals_bounds
FROM example_data;

-- Sparse data handling
-- These are equivalent:
SELECT weighted_mean(ARRAY[10, 20, 0, 0], ARRAY[0.2, 0.3, 0.25, 0.25]);
SELECT weighted_mean(ARRAY[10, 20], ARRAY[0.2, 0.3]); -- Auto-adds 0.5 weight of zeros
-- Both return: 8.0

Function Comparison

| Function | Method | Best For | | —————–– | ———————— | ———————————– | | weighted_quantile | Empirical CDF | General use, fast computation | | wquantile | Type 7 (R/NumPy default) | Standard statistical analysis | | whdquantile | Harrell-Davis | Smooth estimates, light-tailed data |

Testing & Validation

Multi-tier testing ensures both mathematical correctness and behavioral consistency:

# Mathematical validation (against Python reference)
cd reference && python validate_against_reference.py --database test_db

# Regression testing 
make installcheck
# OR alternative: ./test/run_tests.sh

# Performance benchmarking
./benchmark/run_benchmark.sh

Validation Results: 100% accuracy against Python reference implementations following Akinshin (2023) methods.

Performance

Benchmarked results (5-iteration averages) show significant performance advantages:

  • up to 14x faster than optimized PL/pgSQL (varies by function and array size)
  • Sub-millisecond to millisecond execution: 0.06ms (1K quantiles) to 18.77ms (100K mean PL/pgSQL)
  • Quantiles show biggest gains: Consistent 12-14x faster than PL/pgSQL across all sizes
  • Mean functions scale differently: Equal at 1K elements, 4x faster at 100K elements
  • Efficient quantile methods: Empirical CDF fastest, Type 7 ~1.7x slower, Harrell-Davis 25-33x slower

For more details, see the benchmark/README.md.

Development & Testing

# Build
make clean && make && sudo make install

# Test mathematical correctness
python reference/validate_against_reference.py

# Run regression tests
make installcheck

# Benchmark performance  
./benchmark/run_benchmark.sh

Troubleshooting: Install dev headers with sudo apt-get install postgresql-server-dev-$(pg_config --version | grep -oP '\d+')

Disclaimer

The C implementation and PostgreSQL extension setup were generated by AI assistance, based on the self-written Python reference implementations in the reference/ directory. These Python implementations follow the weighted statistics methods described in Akinshin (2023).

All functions have been rigorously tested against the Python reference implementations to ensure mathematical correctness. The validation harness at reference/validate_against_reference.py compares results with high precision tolerance (typically 1e-10) across diverse test cases.


Reference: Akinshin, A. (2023). Weighted quantile estimators. arXiv preprint.

Production-ready PostgreSQL extension providing 7 mathematically validated weighted statistics functions with high performance and automatic sparse data handling.