Privacy Protection Assessment

Evaluate the privacy protection level of processed data by simulating three privacy attack scenarios. The evaluation uses Anonymeter, a Python library developed by Statice that implements the anonymization evaluation standards proposed by the Article 29 Working Party (WP29) of EU Data Protection Directive in 2014 and received endorsement from the French Data Protection Authority (CNIL) in 2023.

Usage Examples

Click the below button to run this example in Colab:

Note: If using Colab, please see the runtime setup guide.

Singling Out Risk

Splitter:
  external_split:
    method: custom_data
    filepath:
      ori: benchmark://adult-income_ori
      control: benchmark://adult-income_control
    schema:
      ori: benchmark://adult-income_schema
      control: benchmark://adult-income_schema
Synthesizer:
  external_data:
    method: custom_data
    filepath: benchmark://adult-income_syn
    schema: benchmark://adult-income_schema
Evaluator:
  singling_out_risk:
    method: anonymeter-singlingout
    n_attacks: 400          # Number of attacks (default: 2,000)
    n_cols: 3               # Columns per query (default: 3)
    max_attempts: 4000      # Maximum attempts (default: 500,000)

Linkability Risk

Splitter:
  external_split:
    method: custom_data
    filepath:
      ori: benchmark://adult-income_ori
      control: benchmark://adult-income_control
    schema:
      ori: benchmark://adult-income_schema
      control: benchmark://adult-income_schema
Synthesizer:
  external_data:
    method: custom_data
    filepath: benchmark://adult-income_syn
    schema: benchmark://adult-income_schema
Evaluator:
  linkability_risk:
    method: anonymeter-linkability
    max_n_attacks: true      # Use control dataset size (default: true)
    n_neighbors: 1           # Nearest neighbors (default: 1)
    aux_cols:                # Auxiliary columns (default: None)
      -                      # First list: Public data columns
        - workclass
        - education
        - occupation
        - race
        - gender
      -                      # Second list: Private data columns
        - age
        - marital-status
        - relationship
        - native-country
        - income

Inference Risk

Evaluator:
  inference_risk:
    method: anonymeter-inference
    max_n_attacks: true      # Use control dataset size (default: true)
    secret: income           # Sensitive column to infer (required)

Main Parameters

Singling Out Risk Parameters

Parameter	Type	Required	Default	Description
method	`string`	Required	-	Fixed value: `anonymeter-singlingout`
n_attacks	`integer`	Optional	2,000	Number of attack executions Recommendation: Standardize to 2,000
n_cols	`integer`	Optional	3	Number of columns used per query Recommendation: Use 3-column multivariate mode
max_attempts	`integer`	Optional	500,000	Maximum attempts to find successful attacks Recommendation: Reduce only when execution time is too long

ℹ️

Note on Computational Efficiency: Since anonymeter’s singling out performs sampling with replacement for attack attempts, if the data cannot achieve the expected number of attacks and there’s no check mechanism, it will still try to exhaust the maximum attempts, causing significant computational burden.

NICS Recommended Guidelines:

n_attacks: Between 100 and n_rows/100
max_attempts: Between 1,000 and n_rows/10

Linkability Risk Parameters

Parameter	Type	Required	Default	Description
method	`string`	Required	-	Fixed value: `anonymeter-linkability`
n_attacks	`integer`	Optional	None	Number of attack executions Can be omitted when max_n_attacks=true Note: Ignored when max_n_attacks is true
max_n_attacks	`boolean`	Optional	true	Whether to automatically adjust n_attacks to match control dataset size When false: Uses the configured n_attacks value (n_attacks must be specified) When true (default): Ignores n_attacks setting and uses control dataset size instead
aux_cols	`array`	Optional	None	Auxiliary information columns Format: Two non-overlapping lists, simulating data held by different entities Selection guideline: Divide column names into two lists based on understanding of systems, functions, and business logic. This simulates scenarios where data is held or released by different entities. Not all variables need to be included, but key variables should be covered. The division is relatively subjective and aims to assess linkability attack risks in realistic scenarios.
n_neighbors	`integer`	Optional	1	Number of nearest neighbors to consider Recommendation: Set to 1 for strictest evaluation. Since linkability is a difficult attack mode, after failing to find the closest match, other less similar records pose no immediate risk.

Inference Risk Parameters

Parameter	Type	Required	Default	Description
method	`string`	Required	-	Fixed value: `anonymeter-inference`
n_attacks	`integer`	Optional	None	Number of attack executions Can be omitted when max_n_attacks=true Note: Ignored when max_n_attacks is true
max_n_attacks	`boolean`	Optional	true	Whether to automatically adjust n_attacks to match control dataset size When false: Uses the configured n_attacks value (n_attacks must be specified) When true (default): Ignores n_attacks setting and uses control dataset size instead
secret	`string`	Required	-	Name of sensitive column to infer Recommendation: Use target modeling column (Y) or most sensitive column
aux_cols	`array`	Optional	All columns except secret	List of columns used for inference

Others

ℹ️

Missing Value Handling

For Linkability and Inference attacks, PETsARD automatically handles missing values:

Categorical columns: Missing values are filled with the string “missing”
Numerical columns: Columns are converted to float64 and missing values are filled with -999999

This ensures compatibility with anonymeter’s evaluation functions, which require consistent data types for numba JIT compilation.

⚠️

Common Warning and Solutions

If you encounter warnings like:

Reached maximum number of attempts 4000 when generating singling out queries.
Returning 1 instead of the requested 400.
Attack `multivariate` could generate only 1 singling out queries out of the requested 400.

What this means: The data has too few unique combinations to generate enough distinct attack queries. This typically occurs when:

The dataset is too small
Columns have low cardinality (few unique values)
High correlation between columns limits unique combinations

Solutions:

Reduce n_attacks: Set to a smaller value (e.g., 100-500) for small datasets
Increase max_attempts: Allow more attempts to find unique queries (but this increases computation time)
Adjust n_cols: Try using fewer columns per query (e.g., 2 instead of 3)
Accept the limitation: If the warning persists, it indicates the data inherently has limited attack surface, which may actually suggest better privacy protection

Assessment Framework

Anonymeter evaluates privacy risks from three perspectives:

Singling Out Risk

Assesses the possibility of identifying specific individuals within the data. For example: “finding an individual with unique characteristics X, Y, and Z.”

Linkability Risk

Evaluates the possibility of linking records belonging to the same individual across different datasets. For example: “determining that records A and B belong to the same person.”

For handling mixed data types, this assessment uses Gower’s Distance:

Numerical variables: Normalized absolute difference
Categorical variables: Distance of 1 if unequal

Inference Risk

Measures the possibility of inferring attributes from known characteristics. For example: “determining characteristic Z for individuals with characteristics X and Y.”

Evaluation Metrics

Metric	Description	Range	Recommended Standard
risk	Privacy risk score Calculation: `(main attack rate - control attack rate) / (1 - control attack rate)`	0-1	< 0.09¹
attack_rate	Main privacy attack rate (success rate of inferring training data using synthetic data)	0-1	-
baseline_rate	Baseline privacy attack rate (success rate baseline for random guessing)	0-1	-
control_rate	Control privacy attack rate (success rate of inferring control data using synthetic data)	0-1	-

Risk Calculation Details

Privacy Risk Score Formula

The privacy risk score quantifies the additional risk introduced by synthetic data:

$$ \text{Privacy Risk} = \frac{\text{Attack Rate}{\text{Main}} - \text{Attack Rate}{\text{Control}}}{1 - \text{Attack Rate}_{\text{Control}}} $$

This formula measures:

Numerator: Additional risk introduced by synthetic data (relative to control group)
Denominator: Maximum effect of ideal attack (relative to control group)

Scores range from 0-1, with higher values indicating greater privacy risk.

Attack Success Rate Calculation

Attack success rate is calculated using Wilson score for better statistical accuracy:

$$ \text{Attack Rate} = \frac{N_{\text{Success}} + \frac{Z^2}{2}}{N_{\text{Total}} + Z^2} $$

Where:

N_Success: Number of successful attacks
N_Total: Total number of attacks
Z: Z-score for 95% confidence level (1.96)

Three Types of Attack Rates

Main Attack Rate: Success rate of using synthetic data to infer original training data
Baseline Attack Rate: Success rate of random guessing
- If main attack rate ≤ baseline, the assessment is invalid for reference
- Possible causes: insufficient attack attempts, limited auxiliary information, data characteristic issues
Control Attack Rate: Success rate of using synthetic data to infer control group data (holdout set)

References

Personal Data Protection Commission Singapore. (2023). Proposed guide on synthetic data generation. https://www.pdpc.gov.sg/-/media/files/pdpc/pdf-files/other-guides/proposed-guide-on-synthetic-data-generation.pdf
Article 29 Working Party. (2014). Opinion 05/2014 on Anonymisation Techniques (WP216). https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
Anonymeter GitHub Repository. https://github.com/statice/anonymeter
French Data Protection Authority (CNIL). https://www.cnil.fr/en/home

Notes

⚠️

The risk calculated by Anonymeter is only one evaluation approach for each attack mode. A score of 0.0 does not mean zero risk. To avoid potential “Harvest Now, Decrypt Later” (HNDL) risks, users should interpret results with caution.

If main attack rate ≤ baseline attack rate, the evaluation is not suitable for reference
Possible causes: Insufficient attacks, inadequate auxiliary information, data characteristic issues
Recommend combining with other protection measures to protect synthetic data

Data Validity Diagnosis Data Fidelity Assessment