Skip to main content


The enforcement of the GDPR regulation in research urges our communitie to reconsider privacy issues. Risk of re-identification is more than ever a central concern for all European regulations often translating into constraints for data-sharing in science. Interestingly, genomics data have been shared for long time without questioning the intrinsic potentially identifying nature of individual genomic data, as “genetic barcode” or “genetic fingerprint”. A strict Application of GDPR may thus impact research reproducibility in science, data-sharing efforts and ultimately data-driven diagnostic. We propose a solution to share individual HLA genotypes data without compromising on privacy: generate HLA “avatars” from real HLA genotypes data. Based on combination of founder haplotypes estimated by an EM-algorithm from HLA genotypes, under Hardy Weinberg Equilibrium (HWE) proportions, we use an in-silico genetic resampling of HLA haplotypes to generate HLA genotypes of unidentifiable virtual individuals: “avatars”. These HLA genotypes must preserve the individual structure of the original dataset and keeps unchanged global parameters such as allele frequencies, genotype frequencies, sum of top 10, 25, 50 haplotype frequencies (respectively ~20%, 25-30%, ~35% in a population of European ancestry). However, because the “avatarization” process may mimic evolutive bottleneck, the total number of haplotypes is reduced in a statistically significant log-linear dependent way to the sample size (p<10e-4). Haplotypes and alleles occurring less than 5 times in the original dataset are prone to over -and under- sampling, as anticipated by the Gaussian normality approximation (n p (1-p) > 5). Avatarization can be improved by informed iterative resampling that corrects the natural sampling truncation of HLA haplotype under HWE. Beyond genetics, this “digitally-assisted in silico procreation” is a promising data-driven way to facilitate data sharing. The resampling method can also accommodate clinical and demographic annotations by stratification calling for a generalized framework to create avatars in data sharing and data governance in the post-GDPR era context.