Important notice

This is a pre-release version of the extension and is not intended for general use yet. It may be unstable and documentation is limited. If you have any questions, please contact us at hello@open-diffix.org.

User Guide

This document provides detailed information about the configuration, behavior and recommended usage of Diffix for PostgreSQL.

What is Diffix?

Diffix is a bundled set of mechanisms for anonymizing structured data. It was jointly developed by Aircloak GmbH and the Max Planck Institute for Software Systems. Diffix exploits mechanisms that have been in use by national statistics offices for decades: aggregation, generalization, noise, suppression, and swapping. It automatically applies these mechanisms as needed on a query-by-query basis to minimize noise while ensuring strong anonymity.

What is Diffix for PostgreSQL?

Diffix for PostgreSQL is an implementation of the Diffix mechanism for PostgreSQL databases. It is provided as an extension module for PostgreSQL version 13 or higher.

Configuring the extension

System behaviour can be configured using a combination of custom variables and security labels.

Labeling database objects

The module acts as a security provider and allows the marking of database objects with various anonymization labels, which associate additional metadata, needed for anonymization, with existing objects. For more details about security labels, see the official documentation page.

Only superusers can set anonymization labels. To remove an anonymization label from an object, set it to NULL.

Execute SELECT * FROM diffix.show_labels(); to display the current labels in use by the extension.

Tables can be labeled as public or personal. Direct access is allowed to public data even for restricted users. Unlabeled tables can not be queried (unless treat_unmarked_tables_as_public is set to true).

Anonymization ID (AID) columns for a personal table have to be marked with the anonymization label aid. A personal table can have one or more AID columns.

In order to label a table as personal and AID columns, use the diffix.mark_personal(namespace, table_name, aid_columns...) procedure, for example:

SQL CALL diffix.mark_personal('public', 'my_table', 'id', 'last_name');

In order to label a table as public, use the diffix.mark_public(namespace, table_name) procedure.

Regular users can be marked with the anoymization labels direct, anonymized_trusted or anonymized_untrusted. The value of the custom variable pg_diffix.default_access_level determines the access level for unlabeled regular users.

SQL SECURITY LABEL FOR pg_diffix ON ROLE analyst IS 'anonymized_trusted';

Restricted features and extensions

At access levels other than direct, various data and features built into PostgreSQL are restricted. Among others:

  1. Issue utility statements like COPY and ALTER TABLE, beside a few allowlisted ones, are not allowed.
  2. Some of the data in pg_catalog tables like pg_user_functions is not accessible.
  3. Selected subset of less frequently used PostgreSQL query features like EXISTS or NULLIF are disabled.
  4. Inheritance involving a personal table is not allowed.
  5. Some of the output of EXPLAIN for queries involving a personal table is censored.

NOTE If any of the currently blocked features is necessary for your use case, open an issue and let us know.

Row level security (RLS) can be enabled and used on personal tables. It is advised that the active policies are vetted from the point of view of anonymity.

It is also strongly advised to vet any other extensions which are enabled alongside pg_diffix, as well as any user-defined functions and aggregate functions.

System settings

The module exposes a number of custom variables under the pg_diffix prefix. Superusers can change these variables at runtime for their own session, while regular users only have read access to them (with few notable exceptions). To use different values for all future sessions, they have to be set in the configuration file.

Execute SELECT * FROM diffix.show_settings(); to display the current settings of the extension. If the result is empty, make sure pg_diffix is loaded.

Data access settings

pg_diffix.default_access_level - Determines the access level for unlabeled users. Default value is direct.

pg_diffix.session_access_level - Sets the access level for the current session. It can never be higher than the access level for the current user. Can be changed by all users. Defaults to maximum access level allowed.

pg_diffix.treat_unmarked_tables_as_public - Controls whether unmarked tables are readable and treated as public data. Default value is false.

Noise settings

pg_diffix.noise_layer_sd - Standard deviation for each noise layer added to aggregates. Default value is 1.0.

Low count filter settings

pg_diffix.low_count_min_threshold - The lower bound for the number of distinct AID values that must be present in a bucket for it to pass the low count filter. Default value is 2.

pg_diffix.low_count_mean_gap - The number of standard deviations between the lower bound and the mean of the low count filter threshold. Default value is 2.0.

pg_diffix.low_count_layer_sd - The standard deviation for each noise layer used when calculating the low count filter threshold. Default value is 1.0.

pg_diffix.compute_suppress_bin - If True, some rows in the returned result might belong to the suppress bin, combining data of all the suppressed (rejected by the low count filter) protected entities. For an aggregating query, the suppress bin will be returned in the first row, while for a non-aggregating query - in multiple initial rows. In both cases, these rows have all column values NULL (* for text-typed columns, customizable via pg_diffix.text_label_for_suppress_bin). Note that the suppress bin may itself be suppressed and not returned at all.

pg_diffix.text_label_for_suppress_bin - The value to use for the text-typed grouping labels in the suppress bin row. Default value is *.

Aggregation settings

pg_diffix.outlier_count_min - Default value is 1. Must not be greater than outlier_count_max.

pg_diffix.outlier_count_max - Default value is 2. Must not be smaller than outlier_count_min.

pg_diffix.top_count_min - Default value is 4. Must not be greater than top_count_max.

pg_diffix.top_count_max - Default value is 6. Must not be smaller than top_count_min.

NOTE The outlier interval (outlier_count_min, outlier_count_max) must not be wider than the (top_count_min, top_count_max) interval.