Multi-Table Relationships

Applicable when data is scattered across multiple related tables.

Before synthesis, use database denormalization techniques to integrate multiple tables. Avoid using immature multi-table synthesis techniques. Choose appropriate granularity and aggregation methods based on downstream tasks.

If your data is already integrated into a single table, you can skip to Business Logic Constraints.

Case Background

A policy financial institution possesses rich data on corporate financing, including enterprise basic information, financing applications, and financial changes across multiple dimensions of historical records. The institution hopes to promote innovative cooperation with fintech companies through synthetic data technology, allowing third parties to develop risk prediction models using this data while ensuring data privacy, thereby helping the institution improve risk management effectiveness.

Data Characteristics

Complex table structure: Original data is scattered across multiple business system tables, involving enterprise basic data, application records, financial tracking, and other aspects
Time-series data: Contains multiple key time points (such as application date, approval date, tracking time, etc.) with logical sequential relationships between these time points

Table Relationships and Business Significance

The data structure in this case reflects the complete business process of corporate financing, mainly containing three core data tables:

Enterprise Basic Information Table

Contains enterprise ID, industry category, sub-industry, geographical location, and capital information
Each record represents an independent enterprise entity
This table serves as the master table, forming one-to-many relationships with other tables

Financing Application Record Table

Records details of each financing application submitted by enterprises to financial institutions
Includes application type, application date, approval date, application status, and amount information
An enterprise may have multiple financing applications spanning several years
Application results are divided into three statuses: approved, rejected, and withdrawn

Financial Tracking Record Table

Records financial performance tracking data of enterprises after obtaining financing
Includes profit ratio indicators, tracking time range, revenue indicators, and risk level assessment
Each financing application may generate multiple tracking records representing financial conditions at different time points

These three data tables form a hierarchical relationship structure: Enterprise Basic Data(1) → Financing Application Records(N) → Financial Tracking Records(N). In the actual business process, enterprises first establish basic files, then submit financing applications, and each application case triggers a financial tracking mechanism.

Sample Data Demonstration

Considering data privacy, the following uses simulated data to demonstrate data structure and business logic. Although this data is simulated, it retains the key characteristics and business constraints of the original data:

Enterprise Basic Information

company_id	industry	sub_industry	city	district	established_date	capital
C000001	Construction	Environmental Engineering	New Taipei	Banqiao	2019-11-03	19899000
C000002	Construction	Building Engineering	Taipei	Neihu	2017-01-02	17359000
C000003	Manufacturing	Metal Processing	Taipei	Neihu	2012-05-29	5452000

Financing Application Records

application_id	company_id	loan_type	apply_date	approval_date	status	amount_requested	amount_approved
A00000001	C000001	Factory Expansion	2022-01-21	2022-03-19	approved	12848000	12432000.0
A00000002	C000001	Working Capital	2025-01-05	2025-02-11	approved	2076000	1516000.0
A00000004	C000002	Working Capital	2020-12-12	NaN	rejected	5533000	NaN

Financial Tracking Records

application_id	profit_ratio_avg	tracking_months	last_tracking_date	avg_revenue	risk_level
A00000001	0.033225	3.0	2024-09-04	1.840486e+07	high_risk
A00000002	-0.002636	3.0	2027-07-31	1.926350e+07	normal

Why Choose Denormalization Over Multi-Table Synthesis?

Current open-source multi-table synthesis techniques (such as SDV’s HMA) can directly handle multi-table data but have obvious limitations:

Scale and complexity limitations: Best suited for structures with no more than 5 tables and only one level of parent-child relationships
Field type limitations: Primarily supports numerical fields; categorical fields require preprocessing
Constraint handling limitations: Insufficient support for complex business logic constraints
Quality issues: Actual testing shows low cross-table correlation and insufficient association between categorical variables

In contrast, after adopting a denormalization strategy to integrate into a single wide table:

Can use mature and stable single-table synthesis techniques
Can fully define business logic constraints
Synthesis quality is more controllable and predictable
Clearly preserves business logic relationships

Step 1: Data Exploration and Analysis

Analyze Table Relationship Structure

Draw an ER diagram (Entity-Relationship Diagram) to understand the relationships between tables:

Identify relationships between parent and child tables (one-to-many, many-to-many)
Confirm primary key/foreign key correspondences
Observe temporal dependencies in the data

Confirm Downstream Requirements

Discuss analysis objectives with data users (such as data scientists, business analysts):

Enterprise-level analysis: One row per enterprise (one record represents one company)
Application-level analysis: One row per application (one record represents one application)
Time-point analysis: One row per specific time point (one record represents a temporal cross-section)

Step 2: Design Integration Strategy

Choose Appropriate Granularity

Continuing with the enterprise financing case above, if the downstream task focuses on overall risk assessment for each enterprise, the appropriate granularity is “one record per enterprise”. We will integrate the following information:

Enterprise Basic Information (directly included)

Company ID (company_id)
Industry category (industry), sub-industry (sub_industry)
City (city), district (district)
Capital (capital)
Establishment date (established_date)

First Application Record (preserve temporal starting point)

Application date (first_apply_date)
Loan type (first_apply_loan_type)
Requested amount (first_apply_amount_requested)
Approved amount (first_apply_amount_approved)
Application status (first_apply_status)

Latest Application Record (reflect current status)

Application date (latest_apply_date)
Loan type (latest_apply_loan_type)
Requested amount (latest_apply_amount_requested)
Approved amount (latest_apply_amount_approved)
Application status (latest_apply_status)

Latest Financial Tracking (current financial health)

Tracking date (latest_track_date)
Average profit ratio (latest_track_profit_ratio_avg)
Tracking months (latest_track_tracking_months)
Average revenue (latest_track_avg_revenue)
Risk level (latest_track_risk_level)

Handle One-to-Many Relationships

For one-to-many relationships, choose appropriate handling methods:

Select specific records: such as latest, earliest, maximum, minimum values
Statistical aggregation: such as calculating average, sum, count
Expand fields: create independent fields for different time points or states
Keep multiple records: if downstream tasks require it, mark with sequence numbers

Step 3: Execute Denormalization

Method A: Using Python pandas

Suitable for moderate data volumes requiring flexible handling:

import pandas as pd

# Read original data tables
companies = pd.read_csv('companies.csv')
applications = pd.read_csv('applications.csv')
tracking = pd.read_csv('tracking.csv')

# Mark first and latest applications for each company
applications['sort_tuple'] = list(zip(applications['apply_date'], applications['application_id']))

# Find earliest application for each company
min_tuples = applications.groupby('company_id')['sort_tuple'].transform('min')
applications['is_first_application'] = (applications['sort_tuple'] == min_tuples)

# Find latest application for each company
max_tuples = applications.groupby('company_id')['sort_tuple'].transform('max')
applications['is_latest_application'] = (applications['sort_tuple'] == max_tuples)

applications.drop(columns=['sort_tuple'], inplace=True, errors='ignore')

# Join tracking data with application data to get company_id
tracking_w_company = tracking.merge(
    applications[['company_id', 'application_id']],
    how='left',
    left_on='application_id',
    right_on='application_id'
)

# Mark latest financial tracking for each company
tracking_w_company['sort_tuple'] = list(zip(
    tracking_w_company['tracking_date_last_tracking_date'],
    tracking_w_company['application_id']
))

max_tuples = tracking_w_company.groupby('company_id')['sort_tuple'].transform('max')
tracking_w_company['is_latest_tracking'] = (tracking_w_company['sort_tuple'] == max_tuples)

tracking_w_company.drop(columns=['sort_tuple'], inplace=True, errors='ignore')

# Merge enterprise data with application data
denorm_data = companies.merge(
    applications[applications['is_first_application']].add_prefix('first_apply_'),
    how='left',
    left_on='company_id',
    right_on='first_apply_company_id'
).drop(columns=['first_apply_company_id', 'first_apply_is_first_application', 'first_apply_is_latest_application']).merge(
    applications[applications['is_latest_application']].add_prefix('latest_apply_'),
    how='left',
    left_on='company_id',
    right_on='latest_apply_company_id'
).drop(columns=['latest_apply_company_id', 'latest_apply_is_first_application', 'latest_apply_is_latest_application'])

# Add aggregated tracking data
denorm_data = denorm_data.merge(
    tracking_w_company[tracking_w_company['is_latest_tracking']].drop(columns=['sort_tuple'], errors='ignore').add_prefix('latest_track_'),
    how='left',
    left_on='company_id',
    right_on='latest_track_company_id'
).drop(columns=['latest_track_company_id', 'latest_track_is_latest_tracking'])

# Save integrated wide table
denorm_data.to_csv('denormalized_data.csv', index=False)

Method B: Using SQL

Suitable for large data volumes already in databases:

-- Execute denormalization directly in database
WITH first_applications AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY apply_date) as rn
    FROM applications
),
latest_applications AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY apply_date DESC) as rn
    FROM applications
),
latest_tracking AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY application_id ORDER BY tracking_date DESC) as rn
    FROM tracking
)
SELECT
    c.*,
    fa.* AS first_apply_,
    la.* AS latest_apply_,
    lt.* AS latest_track_
FROM companies c
LEFT JOIN first_applications fa ON c.company_id = fa.company_id AND fa.rn = 1
LEFT JOIN latest_applications la ON c.company_id = la.company_id AND la.rn = 1
LEFT JOIN latest_tracking lt ON la.application_id = lt.application_id AND lt.rn = 1;

Using PETsARD After Integration

After integration is complete, use the standard PETsARD workflow:

Loader:
  data:
    filepath: 'denormalized_data.csv'
    schema: 'denormalized_schema.yaml'
Preprocessor:
  default:
    method: 'default'
Synthesizer:
  default:
    method: 'default'
Postprocessor:
  default:
    method: 'default'
Reporter:
  output:
    method: 'save_data'
    source: 'Postprocessor'

Important Notes

When performing data integration, pay special attention to:

Confirm primary key relationships: avoid duplicates or omissions
Handle time series information properly: for example, use summary statistics to preserve important features
Table merge order: affects final results, suggest processing strongly related tables first
Downstream task requirements: to reduce synthesis complexity, keep only necessary fields

Through pre-denormalization processing, you can clearly preserve business logic relationships, reduce data distortion during synthesis, and improve the usability and quality of final synthesized data.

Data Describing Business Logic Constraints