๐Ÿ”’ Privacy Decay Research

Does personal data get less sensitive as it ages? A real-data investigation across GDPR, HIPAA, and GeoLife.

๐Ÿ“„ Read the Research Paper

Long-form academic writeup of all seven findings, the LOUO transfer benchmark, the Finding 6 flip at 100 users, and the Finding 7 metric-artifact reframing. PDF for download/print, HTML for in-browser reading, and a one-page executive summary for the time-pressed reader.

Paper โ€” PDF Primary

16-page xelatex render with embedded figures, references, and reproducibility appendix. Best for download, print, and citation.

Open PDF

Paper โ€” HTML

Self-contained HTML version (figures inlined as base64, MathML math, mobile-responsive). Reads cleanly on phone or desktop โ€” no notebook viewer needed.

Open HTML

Executive Summary 1 page

One-page distillation: the problem, the three findings that matter, one figure (sigmoid vs linear vs exponential), and the implication for practitioners.

Open Summary

๐Ÿ“Š Research Status

98 GeoLife Users (LOUO pool)
48,406 GeoLife GPS Rows
+0.855 Pooled Spatial Rยฒ (median)
7 Findings Documented
Headline (Findings 6 + 7): Pooled spatial transfer wins at scale (median Rยฒ โ‰ˆ +0.85 across 98 held-out users); per-user warm-start fine-tune hurts the typical user. The "spatial outliers" Finding 6 flagged turned out (Finding 7) to be a metric artifact โ€” user 87 has Var(log_k) = 0, which makes Rยฒ mathematically divergent regardless of absolute prediction error. Deployable fix is a metric guard (Rยฒ โ†’ MAE for low-variance users), not a separate model.
Real-data cron: Sunday 06:00 UTC โ€” runs full Kaggle benchmark (GDPR + HIPAA + GeoLife) and re-executes the findings notebook. Last run: 2026-05-03 (numbers reproduced bit-identically).

๐Ÿ“„ Source & Supplementary (2026-05-13)

Full Writeup (Markdown)

RESEARCH_FINDINGS.md โ€” original markdown source for the 7 findings, including the Finding 6 flip (30 โ†’ 100 users) and the Finding 7 reframing (outliers are a metric artifact, not a spatial failure).

Read on GitHub

Narrative Notebook New

privacy_decay_findings.ipynb โ€” 32-cell guided tour through all 7 findings with reproduced figures, plain-language Finding 7 explainer, and the spatial/temporal tension finale.

Open Notebook

Living Notebook

02_Real_Data_Findings.ipynb โ€” runnable notebook with tables, plots, and the convex-hull / log_k-variance analysis. Auto-refreshed by the Sunday real-data cron.

Open Notebook

Per-User Rยฒ Data

JSON dumps with per-user results, outlier rosters, and bootstrap CIs for every analysis (LOUO, fine-tune, hull-membership).

Browse JSONs

๐ŸŽจ Experiment Visualizations

Model Comparison Latest

Compare Random Forest, Gradient Boosting, Neural Network, and Ridge Regression performance.

View Chart

Custom Decay Curves

Visual comparison of 6 different privacy decay strategies over 2 years.

View Chart

Feature Engineering Impact

Analysis showing how different feature sets affect model performance.

View Chart

Privacy Theories Comparison

Validation of different privacy protection theories (age-only, context-first, hybrid, data-type specific).

View Chart

MVP Comprehensive Analysis

Complete MVP analysis with decay curves, distributions, and privacy patterns by data type.

View Chart

ML Model Performance

Detailed ML model metrics including predictions vs actual, residuals, and feature importance.

View Chart

๐Ÿค– Ollama Model Benchmarks

Model Benchmark Results Baseline

Performance comparison of tinyllama, phi, llama3.2, and mistral models across privacy-specific tasks.

View Chart JSON Report

Recommended Models

Default: tinyllama:1.1b (29 tok/s, 4.6s response)
Critical: mistral:7b (80.3 quality, best accuracy)

Browse All

๐ŸŒ™ Automated Test Results

Real-Data Cron Sundays 06:00 UTC

Weekly Kaggle benchmark (GDPR + HIPAA + GeoLife): runs LOUO transfer, per-user fine-tune ablation, sigmoid fits, and re-executes the findings notebook in place.

View Cron Log

Phase 3 Research JSONs

Per-analysis output: louo_transfer.json, per_user_finetune.json, convex_hull_outliers.json, etc. Each includes per-user dumps + outlier rosters + bootstrap CIs.

Browse JSONs

Original Nightly Tests

Earlier nightly Gradient Boosting tests on synthetic data (pre-real-data pivot). Kept for historical comparison.

Browse Directory

System Logs

Cron execution logs across all scheduled jobs (real-data, daily, weekly tests).

Browse Logs

๐Ÿ“ Data & Documentation

Datasets Real Data

GeoLife GPS (48,406 rows / 100 users), GDPR Fines (212 records 2018-2020), HIPAA Breaches (1,632 records 2009-2017), plus original synthetic baselines (5,000 samples).

Browse Datasets

All Visualizations

Browse complete directory of all experiment visualizations and charts.

Browse All

System Logs

Access system logs including cron execution logs and error reports.

Browse Logs

GitHub Repository

View source code, documentation, and full project history.

View on GitHub

๐Ÿ”— Local Reproduction

The full Docker stack (Jupyter, TimescaleDB, Redis, Ollama) reproduces every figure in the paper from raw Kaggle datasets. Credentials are read from a local .env file โ€” see .env.example in the repo.

Service ports (after docker compose up): Jupyter :8889 ยท Postgres/TimescaleDB :5433 ยท Redis :6380 ยท Ollama :11435