Extensions
- weighted_statistics 1.0.0
- High-performance weighted statistics functions for sparse data
Documentation
- CHANGELOG
- Changelog
- README
- Performance Benchmarks
- requirements
- requirements
README
Contents
Weighted Statistics PostgreSQL Extension
High-performance PostgreSQL extension providing weighted statistical functions optimized for sparse data (automatic handling when sum(weights) < 1.0
).
Quick Start
git clone https://github.com/your-repo/weighted_statistics
cd weighted_statistics
make clean && make && sudo make install
psql -c "CREATE EXTENSION weighted_statistics;"
Available Functions:
weighted_mean(values[], weights[])
- Weighted meanweighted_variance(values[], weights[], ddof)
- Weighted variance (ddof: 0=population, 1=sample)weighted_std(values[], weights[], ddof)
- Weighted standard deviationweighted_quantile(values[], weights[], quantiles[])
- Empirical CDF quantileswquantile(values[], weights[], quantiles[])
- Type 7 (Hyndman-Fan) quantileswhdquantile(values[], weights[], quantiles[])
- Harrell-Davis quantilesweighted_median(values[], weights[])
- 50th percentile shortcut for empirical CDF
-- Basic usage examples
SELECT weighted_mean(ARRAY[1.0, 2.0, 3.0], ARRAY[0.2, 0.3, 0.5]);
-- Result: 2.3
SELECT weighted_quantile(ARRAY[10.0, 20.0, 30.0], ARRAY[0.3, 0.4, 0.3], ARRAY[0.25, 0.5, 0.75]);
-- Result: {15.0, 20.0, 25.0}
Key Features
- up to 14x faster than PL/pgSQL implementations
- Sparse data handling: Automatically adds implicit zeros when
sum(weights) < 1.0
- Mathematically validated: 100% accuracy against Python reference implementations
Requirements: PostgreSQL 12+, development headers (postgresql-server-dev-*
), C compiler
Usage Examples
-- multiple statistics
WITH example_data AS (
SELECT array_agg(val) AS vals, array_agg(weight) AS weights
FROM risk_events
)
SELECT
weighted_mean(vals, weights) AS expected_vals,
weighted_std(vals, weights, 1) AS vals_volatility,
weighted_quantile(vals, weights, ARRAY[0.05, 0.95]) AS vals_bounds
FROM example_data;
-- Sparse data handling
-- These are equivalent:
SELECT weighted_mean(ARRAY[10, 20, 0, 0], ARRAY[0.2, 0.3, 0.25, 0.25]);
SELECT weighted_mean(ARRAY[10, 20], ARRAY[0.2, 0.3]); -- Auto-adds 0.5 weight of zeros
-- Both return: 8.0
Function Comparison
| Function | Method | Best For |
| —————–– | ———————— | ———————————– |
| weighted_quantile
| Empirical CDF | General use, fast computation |
| wquantile
| Type 7 (R/NumPy default) | Standard statistical analysis |
| whdquantile
| Harrell-Davis | Smooth estimates, light-tailed data |
Testing & Validation
Multi-tier testing ensures both mathematical correctness and behavioral consistency:
# Mathematical validation (against Python reference)
cd reference && python validate_against_reference.py --database test_db
# Regression testing
make installcheck
# OR alternative: ./test/run_tests.sh
# Performance benchmarking
./benchmark/run_benchmark.sh
Validation Results: 100% accuracy against Python reference implementations following Akinshin (2023) methods.
Performance
Benchmarked results (5-iteration averages) show significant performance advantages:
- up to 14x faster than optimized PL/pgSQL (varies by function and array size)
- Sub-millisecond to millisecond execution: 0.06ms (1K quantiles) to 18.77ms (100K mean PL/pgSQL)
- Quantiles show biggest gains: Consistent 12-14x faster than PL/pgSQL across all sizes
- Mean functions scale differently: Equal at 1K elements, 4x faster at 100K elements
- Efficient quantile methods: Empirical CDF fastest, Type 7 ~1.7x slower, Harrell-Davis 25-33x slower
For more details, see the benchmark/README.md.
Development & Testing
# Build
make clean && make && sudo make install
# Test mathematical correctness
python reference/validate_against_reference.py
# Run regression tests
make installcheck
# Benchmark performance
./benchmark/run_benchmark.sh
Troubleshooting: Install dev headers with sudo apt-get install postgresql-server-dev-$(pg_config --version | grep -oP '\d+')
Disclaimer
The C implementation and PostgreSQL extension setup were generated by AI assistance, based on the self-written Python reference implementations in the reference/
directory. These Python implementations follow the weighted statistics methods described in Akinshin (2023).
All functions have been rigorously tested against the Python reference implementations to ensure mathematical correctness. The validation harness at reference/validate_against_reference.py
compares results with high precision tolerance (typically 1e-10) across diverse test cases.
Reference: Akinshin, A. (2023). Weighted quantile estimators. arXiv preprint.
Production-ready PostgreSQL extension providing 7 mathematically validated weighted statistics functions with high performance and automatic sparse data handling.