Glossary

Docs

Glossary

Technical terms and definitions used in PETsARD documentation (alphabetically ordered).

A

Adapter: In PETsARD, refers to standardized execution wrappers that provide consistent execution interfaces for all modules, a core design pattern of the PETsARD architecture.
Adult Income Dataset: UCI Machine Learning Repository census income dataset containing 48,842 records and 15 fields, one of PETsARD’s standard benchmark datasets.
Anonymeter: Privacy risk assessment tool whose developer Statice is recognized by the French Data Protection Authority (CNIL) as compliant with EU regulations, evaluating singling out, linkability, and inference risks.

B

Balanced Accuracy: Classification accuracy metric that accounts for class imbalance.
Benchmark Dataset: Standard datasets used for testing and validation, such as Adult Income Dataset.
Bimodal Distribution: Probability distribution with two peaks, indicating data has two main concentration areas, presenting challenges for statistical modeling.
Boundary Adherence: Whether numeric or date fields remain within the upper and lower bounds of the original data.
Brier Score: Metric evaluating the accuracy of probabilistic predictions, lower is better.

C

Cardinality: The number of distinct values in a categorical field. High cardinality indicates many category values, low cardinality indicates fewer.
Classification: Machine learning task type for predicting category labels.
Clustering: Unsupervised learning task in machine learning for grouping data.
CNIL: Commission Nationale de l’Informatique et des Libertés. French National Commission on Informatics and Liberty, the French data protection authority that recognized Anonymeter as compliant with EU anonymization standards in 2023.
Cohen’s Kappa: Metric measuring classification agreement, accounting for chance agreement.
Confidence Interval: In statistics, the range within which the true value of a parameter is estimated to lie, representing the degree of uncertainty in the estimation.
Config: In PETsARD, refers to the system component managing experiment settings, responsible for processing YAML configuration files and coordinating module execution.
Constrainer: In PETsARD, refers to the system module that performs constraint checking and enforcement.
Constraints: In PETsARD, refers to the rule system ensuring synthetic data complies with business specifications, including column constraints, column combination constraints, and missing value group constraints.
Contingency Table: Cross-tabulation showing relationships between two categorical variables.
Control Data: In PETsARD evaluation, refers to data not used for synthesis, retained as an independent test set.
Control Group: Baseline dataset used for comparison in privacy risk assessment.
Copula Function: Used to describe the dependence structure among variables in a multivariate distribution. Gaussian Copula uses Gaussian distribution as the copula function.
CopulaGAN: Hybrid synthesis method provided by SDV, combining Copula statistical methods with GAN deep learning technology, balancing quality and efficiency.
Correlation Coefficient: Statistic measuring the strength of linear relationship between two variables.
Cross-Validation: Technique for assessing model generalization ability by splitting data into multiple folds for training and validation.
CSV: Comma-Separated Values. File format for comma-delimited data, one of the primary data input/output formats supported by PETsARD.
CTGAN: Conditional Tabular GAN. Tabular data synthesis method using generative adversarial networks, specialized for mixed-type data.

D

Datacebo: Development company of SDV (Synthetic Data Vault), proposing innovative technologies such as Uniform Encoding.
Default Evaluation: Default evaluation methods provided by the PETsARD system, including basic metrics for privacy, fidelity, and utility dimensions.
Default Synthesis: Default synthetic data generation method provided by the PETsARD system, using built-in Gaussian Copula implementation (petsard-gaussian_copula).
Denormalization: Database processing technique merging multiple related tables into a single wide table, used to simplify multi-table data synthesis.
Describer: In PETsARD, refers to the system module that analyzes and describes statistical properties of data, generating data overview reports.
Differential Privacy: Mathematically defined privacy protection method ensuring individual information is not leaked.
Direct Identifier: Fields that can directly identify individual identity, such as ID numbers, names, etc.
Discretization: Preprocessing technique converting continuous values to categorical data, such as K-bins discretization.
Docker: Containerization technology, PETsARD provides Docker images for rapid deployment.
Docker Compose: Tool for defining and running multi-container Docker applications, used to coordinate PETsARD services.
Domain Transfer: In PETsARD MLUtility, evaluating deployment performance of models trained on synthetic data when applied to real data.
Dual Model Control: Experimental design training models using both original and synthetic data, comparing performance on control data.

E

Eigenvalue Decomposition: Mathematical operation decomposing a matrix into eigenvectors and eigenvalues. Ledoit-Wolf regularization avoids this operation to improve efficiency.
Evaluator: In PETsARD, refers to the system module executing privacy, fidelity, and utility evaluations, integrating third-party tools like Anonymeter and SDMetrics.
Excel: Microsoft Excel spreadsheet format (.xlsx, .xls), one of the data input formats supported by PETsARD, requires openpyxl package installation.
Executor: In PETsARD, refers to the main execution interface of the pipeline, the core control component coordinating execution flow of all modules.
Experiment Repetition: Mechanism in PETsARD for executing the same experiment multiple times to ensure result reliability.
Experiment Tuple: In PETsARD, an identification pair composed of module name and experiment name, formatted as (module_name, experiment_name).
External Synthesis: Synthetic data generated using tools or methods outside of PETsARD. PETsARD can load these externally generated synthetic data and evaluate quality using built-in assessment framework to help compare different synthesis methods.

F

F1 Score: Harmonic mean of precision and recall, comprehensive evaluation metric for classification tasks.
FD Rule: Freedman-Diaconis Rule. Histogram bin width selection rule with formula 2 × IQR / n^(1/3), suitable for large sample data.
Fidelity: Degree of similarity in statistical distribution between synthetic and original data.
FNR: False Negative Rate. Proportion of actual positives predicted as negative.
FPR: False Positive Rate. Proportion of actual negatives predicted as positive.

G

GAN: Generative Adversarial Network. A deep learning architecture composed of a generator and discriminator, used for generating high-quality synthetic data.
Gaussian Copula: Built-in default synthesis method used by PETsARD, preserving correlation structure between data using Gaussian distribution and Copula functions. Accelerated using Numba JIT compilation technology.
GHCR: GitHub Container Registry. Container image storage service provided by GitHub, hosting PETsARD Docker images.
Granularity: In PETsARD reports, refers to levels of evaluation results, including global, columnwise, pairwise, details, tree, etc.

H

HDF5: Hierarchical Data Format 5. Hierarchical data format for storing large-scale numerical datasets.
Histogram: Chart grouping numerical data into intervals and displaying frequency, used for distribution comparison in fidelity assessment.
HMA: Hierarchical Modeling Algorithm. Hierarchical modeling algorithm provided by SDV, using recursive techniques to model parent-child relationships in multi-table datasets, but with scale and complexity limitations.
Hyperparameters: External settings controlling the model training process, such as learning rate, batch size.

I

Inference Risk: Degree of risk that sensitive information can be inferred, assessing whether other attributes can be deduced from known information.
Isolation Forest: Decision tree-based anomaly detection algorithm for identifying outliers in data, one of the outlier handling methods supported by PETsARD.

J

Jaccard Score: Metric measuring set similarity, used for classification task evaluation.
Jensen-Shannon Divergence: Symmetric metric measuring differences between two probability distributions.
JIT: Just-In-Time Compilation. Compilation technique where Numba uses JIT to compile Python code into machine code for improved execution performance.
JSON: JavaScript Object Notation. Lightweight data interchange format, one of the configuration and data formats supported by PETsARD.
Jupyter Lab: Interactive development environment supporting Notebook, code editing, and data visualization, optionally included in PETsARD Docker images.

K

K-S Test: Kolmogorov-Smirnov Test. Statistical test comparing differences between two empirical distributions.

L

Label Encoding: Encoding method converting category values to continuous integers, suitable for ordinal categorical variables.
Ledoit-Wolf Regularization: Shrinkage estimation method for covariance matrix with formula Σ_reg = (1-λ)Σ + λI, used to improve correlation estimation for small sample or high-dimensional data.
Linkability Risk: Degree of risk that records from different data sources can be linked, assessing whether the same individual can be linked across different datasets.
Loader: In PETsARD, refers to the system module responsible for reading and loading data, supporting various file formats and benchmark datasets.
Log Loss: Logarithmic loss, metric measuring accuracy of classification model probability predictions.
Log Transformation: Preprocessing technique using logarithmic functions to transform skewed distribution data.

M

MAE: Mean Absolute Error. Mean absolute error, evaluation metric for regression tasks.
MCC: Matthews Correlation Coefficient. Classification evaluation metric comprehensively considering all elements of the confusion matrix.
Metadata: Information describing data characteristics, including field types, distributions, constraints, etc.
Metadater: In PETsARD, refers to the core system component responsible for managing and maintaining data schema information, uniformly handling metadata requirements for all modules.
Missing Value Handling: Techniques for handling missing data, including deletion, mean imputation, mode imputation, median imputation, etc.
MLUtility: Module in PETsARD evaluating machine learning utility of synthetic data. V1 version evaluated multiple models simultaneously (deprecated), V2 version uses XGBoost (classification/regression) and K-means (clustering).
Model Parameters: Internal configuration of machine learning models, such as neural network weights.
mpUCCs: Maximal Partial Unique Column Combinations. Theoretical foundation for advanced singling out risk assessment.
MSE: Mean Squared Error. Mean squared error, evaluation metric for regression tasks.

N

NaN Groups: Not a Number Groups. Constraint rules for handling missing values, including operations like deletion, filling, or copying.
Naming Strategy: Setting in PETsARD Reporter controlling output filename format, including traditional and compact modes.
Non-parametric Estimation: Statistical method that does not assume data follows a specific probability distribution, more flexible but with higher computational cost.
NPV: Negative Predictive Value. Proportion of actual negatives among predicted negatives.
Numba: Python JIT compiler that compiles numerical computation code into machine code for significantly improved execution speed. PETsARD’s Gaussian Copula implementation uses Numba for acceleration.
NumPy: Python’s core numerical computation library providing high-performance multi-dimensional array operations, one of PETsARD’s fundamental dependencies.

O

One-Hot Encoding: Encoding method converting each category value to independent binary features, suitable for unordered categorical variables.
OpenDocument: Open document format (.ods, .odf, .odt), data input format supported by PETsARD, requires openpyxl package installation.
openpyxl: Python library for reading and writing Excel and OpenDocument format files, a necessary dependency for PETsARD to support these formats.
Original Data: In PETsARD, refers to the dataset used for training synthetic models, which may be real data or processed data.
Outlier: Extreme values deviating from normal data distribution.
Outlier Handling: Techniques for identifying and handling anomalous values in data, including Z-score, IQR, LOF methods, etc.

P

Pandas: Python data analysis library providing data structures like DataFrame, one of PETsARD’s core dependencies.
Parametric Statistics: Statistical method assuming data follows a specific probability distribution (such as normal distribution), as opposed to non-parametric statistics.
Parquet: Columnar binary file format suitable for efficient access of large datasets.
PETsARD: Privacy-Enhanced Technology for Synthetic Assessment Reporting and Decision. Open-source synthetic data evaluation framework developed by the National Institute of Cyber Security.
Postprocessing: In PETsARD, refers to restoration processing steps after synthetic data generation, converting preprocessed data back to original format.
PPV: Positive Predictive Value. Also known as precision, proportion of actual positives among predicted positives.
PR AUC: Precision-Recall Area Under the Curve. Area under Precision-Recall curve, evaluation metric suitable for imbalanced datasets.
Precision: Proportion of actual positives among predicted positives.
Preprocessing: In PETsARD, refers to preparation processing steps before data synthesis, including missing value handling, outlier handling, encoding, scaling, etc.
Primary Key: Field uniquely identifying each record in a data table.
Privacy Protection: Degree of preventing individual information leakage after data processing.
Python: Programming language used by PETsARD, providing rich data science and machine learning ecosystem.
PyTorch: Deep learning framework used by PETsARD for GPU-accelerated large-scale matrix operations and deep learning model training.

Q

Quasi-identifier: QID. Fields that are not direct identifiers but may identify individuals when combined.

R

R² Score: Coefficient of determination, metric measuring variance explained by regression models.
Recall: Proportion correctly predicted among actual positives, also known as sensitivity.
Regression: Machine learning task type for predicting continuous values.
Regularization: Technique in statistics and machine learning for reducing model complexity and improving generalization ability. Ledoit-Wolf regularization is used for covariance matrix estimation.
Reporter: In PETsARD, refers to the system module generating and storing experiment result reports, supporting multiple output formats.
RMSE: Root Mean Squared Error. Root mean squared error, evaluation metric for regression tasks.
ROC AUC: Receiver Operating Characteristic Area Under the Curve. Area under Receiver Operating Characteristic curve, comprehensive performance metric for classification models.

S

Scaling: Preprocessing techniques adjusting numerical ranges, including standardization, min-max scaling, time-anchored scaling, etc.
Schema: Metadata defining data structure, including field names, data types, constraints, and relationships. In PETsARD, used to track structural changes of data throughout the processing pipeline.
Scikit-learn: Abbreviated as sklearn. Python machine learning library providing classification, regression, clustering algorithms, used by PETsARD for machine learning utility evaluation.
SDMetrics: Evaluation tool in the SDV ecosystem for assessing synthetic data quality, fidelity, and diagnostic reports.
SDV: Synthetic Data Vault. Third-party synthetic data package, optionally supported by PETsARD (requires separate installation: pip install 'sdv>=1.26.0,<2', for reference only).
Sensitivity: Also known as recall, proportion correctly predicted among actual positives.
Silhouette Coefficient: Metric evaluating clustering quality, ranging from -1 to 1.
Singling Out Risk: Degree of risk that individual records can be uniquely identified, assessing whether specific individuals can be identified from data.
SMOTE: Synthetic Minority Over-sampling Technique for handling imbalanced data.
SMOTE-ENN: Imbalanced data processing method combining SMOTE with Edited Nearest Neighbors, oversampling followed by boundary sample cleaning.
SMOTE-Tomek: Imbalanced data processing method combining SMOTE with Tomek Links, oversampling followed by Tomek link removal.
Specificity: Proportion correctly predicted as negative among actual negatives.
Splitter: In PETsARD, refers to the system module splitting data into training and validation sets, supporting multiple splits required for privacy assessment.
SQL: Structured Query Language. Language for database operations and data processing.
Statice: Development company of Anonymeter, focusing on privacy protection and synthetic data technology.
Sturges’ Rule: Histogram bin number selection rule with formula log₂(n) + 1, suitable for small sample data.
Synthetic Data: Artificial data generated through machine learning models, preserving statistical properties of original data without containing real individual information.
Synthesizer: In PETsARD, refers to the core system module generating synthetic data, integrating various synthesis algorithms from SDV, custom implementations, etc.

T

Threshold: Decision boundary value used to convert continuous predictions to category labels.
Time Anchoring: Method for handling multi-timepoint data, setting the most important time field as anchor and converting other timepoints to relative time differences.
Total Variation Distance: TVD. Statistic measuring differences between two probability distributions.
TSV: Tab-Separated Values. Tab-delimited file format, one of the data input formats supported by PETsARD.
TVAE: Tabular Variational Autoencoder. Tabular data synthesis method using variational autoencoders, focusing on data distribution characteristics.

U

UCI: University of California, Irvine. Its machine learning repository provides multiple standard datasets, including Adult Income Dataset.
Uniform Encoding: Categorical variable processing method proposed by Datacebo, mapping discrete category values to continuous [0,1] interval while preserving statistical properties of category distribution.
UTF-8: Unicode Transformation Format - 8-bit. Default character encoding used by PETsARD, supporting multilingual text processing.
Utility: Performance capability of synthetic data in machine learning tasks.

V

VAE: Variational Autoencoder. A generative model that learns latent representations of data through an encoder and decoder. TVAE is based on this architecture.
Validity: Degree to which data accurately reflects fundamental characteristics and structure.

X

XGBoost: eXtreme Gradient Boosting. Gradient boosting decision tree algorithm used for classification and regression tasks in PETsARD MLUtility V2.

Y

YAML: YAML Ain’t Markup Language. Human-readable data serialization format used by PETsARD as the primary configuration file format.

Error Handling Python API