Glossary

Glossary

Technical terms and definitions used in PETsARD documentation (alphabetically ordered).

A

  • Adapter: In PETsARD, refers to standardized execution wrappers that provide consistent execution interfaces for all modules, a core design pattern of the PETsARD architecture.
  • Adult Income Dataset: UCI Machine Learning Repository census income dataset containing 48,842 records and 15 fields, one of PETsARD’s standard benchmark datasets.
  • Anonymeter: Privacy risk assessment tool whose developer Statice is recognized by the French Data Protection Authority (CNIL) as compliant with EU regulations, evaluating singling out, linkability, and inference risks.

B

  • Balanced Accuracy: Classification accuracy metric that accounts for class imbalance.
  • Benchmark Dataset: Standard datasets used for testing and validation, such as Adult Income Dataset.
  • Bimodal Distribution: Probability distribution with two peaks, indicating data has two main concentration areas, presenting challenges for statistical modeling.
  • Boundary Adherence: Whether numeric or date fields remain within the upper and lower bounds of the original data.
  • Brier Score: Metric evaluating the accuracy of probabilistic predictions, lower is better.

C

  • Cardinality: The number of distinct values in a categorical field. High cardinality indicates many category values, low cardinality indicates fewer.
  • Classification: Machine learning task type for predicting category labels.
  • Clustering: Unsupervised learning task in machine learning for grouping data.
  • CNIL: Commission Nationale de l’Informatique et des Libertés. French National Commission on Informatics and Liberty, the French data protection authority that recognized Anonymeter as compliant with EU anonymization standards in 2023.
  • Cohen’s Kappa: Metric measuring classification agreement, accounting for chance agreement.
  • Confidence Interval: In statistics, the range within which the true value of a parameter is estimated to lie, representing the degree of uncertainty in the estimation.
  • Config: In PETsARD, refers to the system component managing experiment settings, responsible for processing YAML configuration files and coordinating module execution.
  • Constrainer: In PETsARD, refers to the system module that performs constraint checking and enforcement.
  • Constraints: In PETsARD, refers to the rule system ensuring synthetic data complies with business specifications, including column constraints, column combination constraints, and missing value group constraints.
  • Contingency Table: Cross-tabulation showing relationships between two categorical variables.
  • Control Data: In PETsARD evaluation, refers to data not used for synthesis, retained as an independent test set.
  • Control Group: Baseline dataset used for comparison in privacy risk assessment.
  • Copula Function: Used to describe the dependence structure among variables in a multivariate distribution. Gaussian Copula uses Gaussian distribution as the copula function.
  • CopulaGAN: Hybrid synthesis method provided by SDV, combining Copula statistical methods with GAN deep learning technology, balancing quality and efficiency.
  • Correlation Coefficient: Statistic measuring the strength of linear relationship between two variables.
  • Cross-Validation: Technique for assessing model generalization ability by splitting data into multiple folds for training and validation.
  • CSV: Comma-Separated Values. File format for comma-delimited data, one of the primary data input/output formats supported by PETsARD.
  • CTGAN: Conditional Tabular GAN. Tabular data synthesis method using generative adversarial networks, specialized for mixed-type data.

D

  • Datacebo: Development company of SDV (Synthetic Data Vault), proposing innovative technologies such as Uniform Encoding.
  • Default Evaluation: Default evaluation methods provided by the PETsARD system, including basic metrics for privacy, fidelity, and utility dimensions.
  • Default Synthesis: Default synthetic data generation method provided by the PETsARD system, using SDV’s Gaussian Copula model.
  • Denormalization: Database processing technique merging multiple related tables into a single wide table, used to simplify multi-table data synthesis.
  • Describer: In PETsARD, refers to the system module that analyzes and describes statistical properties of data, generating data overview reports.
  • Differential Privacy: Mathematically defined privacy protection method ensuring individual information is not leaked.
  • Direct Identifier: Fields that can directly identify individual identity, such as ID numbers, names, etc.
  • Discretization: Preprocessing technique converting continuous values to categorical data, such as K-bins discretization.
  • Docker: Containerization technology, PETsARD provides Docker images for rapid deployment.
  • Docker Compose: Tool for defining and running multi-container Docker applications, used to coordinate PETsARD services.
  • Domain Transfer: In PETsARD MLUtility, evaluating deployment performance of models trained on synthetic data when applied to real data.
  • Dual Model Control: Experimental design training models using both original and synthetic data, comparing performance on control data.

E

  • Eigenvalue Decomposition: Mathematical operation decomposing a matrix into eigenvectors and eigenvalues. Ledoit-Wolf regularization avoids this operation to improve efficiency.
  • Evaluator: In PETsARD, refers to the system module executing privacy, fidelity, and utility evaluations, integrating third-party tools like Anonymeter and SDMetrics.
  • Excel: Microsoft Excel spreadsheet format (.xlsx, .xls), one of the data input formats supported by PETsARD, requires openpyxl package installation.
  • Executor: In PETsARD, refers to the main execution interface of the pipeline, the core control component coordinating execution flow of all modules.
  • Experiment Repetition: Mechanism in PETsARD for executing the same experiment multiple times to ensure result reliability.
  • Experiment Tuple: In PETsARD, an identification pair composed of module name and experiment name, formatted as (module_name, experiment_name).
  • External Synthesis: Synthetic data generated using tools or methods outside of PETsARD. PETsARD can load these externally generated synthetic data and evaluate quality using built-in assessment framework to help compare different synthesis methods.

F

  • F1 Score: Harmonic mean of precision and recall, comprehensive evaluation metric for classification tasks.
  • FD Rule: Freedman-Diaconis Rule. Histogram bin width selection rule with formula 2 × IQR / n^(1/3), suitable for large sample data.
  • Fidelity: Degree of similarity in statistical distribution between synthetic and original data.
  • FNR: False Negative Rate. Proportion of actual positives predicted as negative.
  • FPR: False Positive Rate. Proportion of actual negatives predicted as positive.

G

  • GAN: Generative Adversarial Network. A deep learning architecture composed of a generator and discriminator, used for generating high-quality synthetic data.
  • Gaussian Copula: Default synthesis method used by PETsARD, preserving correlation structure between data using Gaussian distribution and Copula functions.
  • GHCR: GitHub Container Registry. Container image storage service provided by GitHub, hosting PETsARD Docker images.
  • Granularity: In PETsARD reports, refers to levels of evaluation results, including global, columnwise, pairwise, details, tree, etc.

H

  • HDF5: Hierarchical Data Format 5. Hierarchical data format for storing large-scale numerical datasets.
  • Histogram: Chart grouping numerical data into intervals and displaying frequency, used for distribution comparison in fidelity assessment.
  • HMA: Hierarchical Modeling Algorithm. Hierarchical modeling algorithm provided by SDV, using recursive techniques to model parent-child relationships in multi-table datasets, but with scale and complexity limitations.
  • Hyperparameters: External settings controlling the model training process, such as learning rate, batch size.

I

  • Inference Risk: Degree of risk that sensitive information can be inferred, assessing whether other attributes can be deduced from known information.
  • Isolation Forest: Decision tree-based anomaly detection algorithm for identifying outliers in data, one of the outlier handling methods supported by PETsARD.

J

  • Jaccard Score: Metric measuring set similarity, used for classification task evaluation.
  • Jensen-Shannon Divergence: Symmetric metric measuring differences between two probability distributions.
  • JIT: Just-In-Time Compilation. Compilation technique where Numba uses JIT to compile Python code into machine code for improved execution performance.
  • JSON: JavaScript Object Notation. Lightweight data interchange format, one of the configuration and data formats supported by PETsARD.
  • Jupyter Lab: Interactive development environment supporting Notebook, code editing, and data visualization, optionally included in PETsARD Docker images.

K

  • K-S Test: Kolmogorov-Smirnov Test. Statistical test comparing differences between two empirical distributions.

L

  • Label Encoding: Encoding method converting category values to continuous integers, suitable for ordinal categorical variables.
  • Ledoit-Wolf Regularization: Shrinkage estimation method for covariance matrix with formula Σ_reg = (1-λ)Σ + λI, used to improve correlation estimation for small sample or high-dimensional data.
  • Linkability Risk: Degree of risk that records from different data sources can be linked, assessing whether the same individual can be linked across different datasets.
  • Loader: In PETsARD, refers to the system module responsible for reading and loading data, supporting various file formats and benchmark datasets.
  • Log Loss: Logarithmic loss, metric measuring accuracy of classification model probability predictions.
  • Log Transformation: Preprocessing technique using logarithmic functions to transform skewed distribution data.

M

  • MAE: Mean Absolute Error. Mean absolute error, evaluation metric for regression tasks.
  • MCC: Matthews Correlation Coefficient. Classification evaluation metric comprehensively considering all elements of the confusion matrix.
  • Metadata: Information describing data characteristics, including field types, distributions, constraints, etc.
  • Metadater: In PETsARD, refers to the core system component responsible for managing and maintaining data schema information, uniformly handling metadata requirements for all modules.
  • Missing Value Handling: Techniques for handling missing data, including deletion, mean imputation, mode imputation, median imputation, etc.
  • MLUtility: Module in PETsARD evaluating machine learning utility of synthetic data. V1 version evaluated multiple models simultaneously (deprecated), V2 version uses XGBoost (classification/regression) and K-means (clustering).
  • Model Parameters: Internal configuration of machine learning models, such as neural network weights.
  • mpUCCs: Maximal Partial Unique Column Combinations. Theoretical foundation for advanced singling out risk assessment.
  • MSE: Mean Squared Error. Mean squared error, evaluation metric for regression tasks.

N

  • NaN Groups: Not a Number Groups. Constraint rules for handling missing values, including operations like deletion, filling, or copying.
  • Naming Strategy: Setting in PETsARD Reporter controlling output filename format, including traditional and compact modes.
  • Non-parametric Estimation: Statistical method that does not assume data follows a specific probability distribution, more flexible but with higher computational cost.
  • NPV: Negative Predictive Value. Proportion of actual negatives among predicted negatives.
  • Numba: Python JIT compiler that compiles numerical computation code into machine code for significantly improved execution speed. PETsARD’s Gaussian Copula implementation uses Numba for acceleration.
  • NumPy: Python’s core numerical computation library providing high-performance multi-dimensional array operations, one of PETsARD’s fundamental dependencies.

O

  • One-Hot Encoding: Encoding method converting each category value to independent binary features, suitable for unordered categorical variables.
  • OpenDocument: Open document format (.ods, .odf, .odt), data input format supported by PETsARD, requires openpyxl package installation.
  • openpyxl: Python library for reading and writing Excel and OpenDocument format files, a necessary dependency for PETsARD to support these formats.
  • Original Data: In PETsARD, refers to the dataset used for training synthetic models, which may be real data or processed data.
  • Outlier: Extreme values deviating from normal data distribution.
  • Outlier Handling: Techniques for identifying and handling anomalous values in data, including Z-score, IQR, LOF methods, etc.

P

  • Pandas: Python data analysis library providing data structures like DataFrame, one of PETsARD’s core dependencies.
  • Parametric Statistics: Statistical method assuming data follows a specific probability distribution (such as normal distribution), as opposed to non-parametric statistics.
  • Parquet: Columnar binary file format suitable for efficient access of large datasets.
  • PETsARD: Privacy-Enhanced Technology for Synthetic Assessment Reporting and Decision. Open-source synthetic data evaluation framework developed by the National Institute of Cyber Security.
  • Postprocessing: In PETsARD, refers to restoration processing steps after synthetic data generation, converting preprocessed data back to original format.
  • PPV: Positive Predictive Value. Also known as precision, proportion of actual positives among predicted positives.
  • PR AUC: Precision-Recall Area Under the Curve. Area under Precision-Recall curve, evaluation metric suitable for imbalanced datasets.
  • Precision: Proportion of actual positives among predicted positives.
  • Preprocessing: In PETsARD, refers to preparation processing steps before data synthesis, including missing value handling, outlier handling, encoding, scaling, etc.
  • Primary Key: Field uniquely identifying each record in a data table.
  • Privacy Protection: Degree of preventing individual information leakage after data processing.
  • Python: Programming language used by PETsARD, providing rich data science and machine learning ecosystem.
  • PyTorch: Deep learning framework used by PETsARD for GPU-accelerated large-scale matrix operations and deep learning model training.

Q

  • Quasi-identifier: QID. Fields that are not direct identifiers but may identify individuals when combined.

R

  • R² Score: Coefficient of determination, metric measuring variance explained by regression models.
  • Recall: Proportion correctly predicted among actual positives, also known as sensitivity.
  • Regression: Machine learning task type for predicting continuous values.
  • Regularization: Technique in statistics and machine learning for reducing model complexity and improving generalization ability. Ledoit-Wolf regularization is used for covariance matrix estimation.
  • Reporter: In PETsARD, refers to the system module generating and storing experiment result reports, supporting multiple output formats.
  • RMSE: Root Mean Squared Error. Root mean squared error, evaluation metric for regression tasks.
  • ROC AUC: Receiver Operating Characteristic Area Under the Curve. Area under Receiver Operating Characteristic curve, comprehensive performance metric for classification models.

S

  • Scaling: Preprocessing techniques adjusting numerical ranges, including standardization, min-max scaling, time-anchored scaling, etc.
  • Schema: Metadata defining data structure, including field names, data types, constraints, and relationships. In PETsARD, used to track structural changes of data throughout the processing pipeline.
  • Scikit-learn: Abbreviated as sklearn. Python machine learning library providing classification, regression, clustering algorithms, used by PETsARD for machine learning utility evaluation.
  • SDMetrics: Evaluation tool in the SDV ecosystem for assessing synthetic data quality, fidelity, and diagnostic reports.
  • SDV: Synthetic Data Vault. Open-source synthetic data generation framework providing various synthesis algorithms.
  • Sensitivity: Also known as recall, proportion correctly predicted among actual positives.
  • Silhouette Coefficient: Metric evaluating clustering quality, ranging from -1 to 1.
  • Singling Out Risk: Degree of risk that individual records can be uniquely identified, assessing whether specific individuals can be identified from data.
  • SMOTE: Synthetic Minority Over-sampling Technique for handling imbalanced data.
  • SMOTE-ENN: Imbalanced data processing method combining SMOTE with Edited Nearest Neighbors, oversampling followed by boundary sample cleaning.
  • SMOTE-Tomek: Imbalanced data processing method combining SMOTE with Tomek Links, oversampling followed by Tomek link removal.
  • Specificity: Proportion correctly predicted as negative among actual negatives.
  • Splitter: In PETsARD, refers to the system module splitting data into training and validation sets, supporting multiple splits required for privacy assessment.
  • SQL: Structured Query Language. Language for database operations and data processing.
  • Statice: Development company of Anonymeter, focusing on privacy protection and synthetic data technology.
  • Sturges’ Rule: Histogram bin number selection rule with formula log₂(n) + 1, suitable for small sample data.
  • Synthetic Data: Artificial data generated through machine learning models, preserving statistical properties of original data without containing real individual information.
  • Synthesizer: In PETsARD, refers to the core system module generating synthetic data, integrating various synthesis algorithms from SDV, custom implementations, etc.

T

  • Threshold: Decision boundary value used to convert continuous predictions to category labels.
  • Time Anchoring: Method for handling multi-timepoint data, setting the most important time field as anchor and converting other timepoints to relative time differences.
  • Total Variation Distance: TVD. Statistic measuring differences between two probability distributions.
  • TSV: Tab-Separated Values. Tab-delimited file format, one of the data input formats supported by PETsARD.
  • TVAE: Tabular Variational Autoencoder. Tabular data synthesis method using variational autoencoders, focusing on data distribution characteristics.

U

  • UCI: University of California, Irvine. Its machine learning repository provides multiple standard datasets, including Adult Income Dataset.
  • Uniform Encoding: Categorical variable processing method proposed by Datacebo, mapping discrete category values to continuous [0,1] interval while preserving statistical properties of category distribution.
  • UTF-8: Unicode Transformation Format - 8-bit. Default character encoding used by PETsARD, supporting multilingual text processing.
  • Utility: Performance capability of synthetic data in machine learning tasks.

V

  • VAE: Variational Autoencoder. A generative model that learns latent representations of data through an encoder and decoder. TVAE is based on this architecture.
  • Validity: Degree to which data accurately reflects fundamental characteristics and structure.

X

  • XGBoost: eXtreme Gradient Boosting. Gradient boosting decision tree algorithm used for classification and regression tasks in PETsARD MLUtility V2.

Y

  • YAML: YAML Ain’t Markup Language. Human-readable data serialization format used by PETsARD as the primary configuration file format.