- Research
- Open Access
- Published:

# Privacy-preserving chi-squared test of independence for small samples

*BioData Mining*
**volume 14**, Article number: 6 (2021)

## Abstract

### Background

The importance of privacy protection in analyses of personal data, such as genome-wide association studies (GWAS), has grown in recent years. GWAS focuses on identifying single-nucleotide polymorphisms (SNPs) associated with certain diseases such as cancer and diabetes, and the chi-squared (*χ*^{2}) hypothesis test of independence can be utilized for this identification. However, recent studies have shown that publishing the results of *χ*^{2} tests of SNPs or personal data could lead to privacy violations. Several studies have proposed anonymization methods for *χ*^{2} testing with *ε*-differential privacy, which is the cryptographic community’s de facto privacy metric. However, existing methods can only be applied to 2×2 or 2×3 contingency tables, otherwise their accuracy is low for small numbers of samples. It is difficult to collect numerous high-sensitive samples in many cases such as COVID-19 analysis in its early propagation stage.

### Results

We propose a novel anonymization method (RandChiDist), which anonymizes *χ*^{2} testing for small samples. We prove that RandChiDist satisfies differential privacy. We also experimentally evaluate its analysis using synthetic datasets and real two genomic datasets. RandChiDist achieved the least number of Type II errors among existing and baseline methods that can control the ratio of Type I errors.

### Conclusions

We propose a new differentially private method, named RandChiDist, for anonymizing *χ*^{2} values for an *I*×*J* contingency table with a small number of samples. The experimental results show that RandChiDist outperforms existing methods for small numbers of samples.

## Introduction

Examining genes involves comparing several groups of genes [1, 2], with three or more groups possibly involved in several instances. Generally, statistical analyses such as the chi-squared (*χ*^{2}) test of independence are used to determine whether single-nucleotide polymorphisms (SNPs) can be considered significantly different. The findings from such analyses are frequently shared between researchers and government agencies to facilitate new discoveries.

A genome can contain sensitive information about an individual such as genetic disease factors and disease risk. Each person’s genome is 99.9% identical, with the remaining 0.1% difference producing peoples’ various characteristics. The variation among individuals at a single position in a genome is known as a SNP. A genome-wide association study (GWAS) is a method of analyzing the statistical relationship between SNPs and diseases by finding SNPs that are related to a specific disease. To accomplish this, *χ*^{2} testing has been used. Homer et al. [3] reported that an attacker may be able to statistically determine whether someone is a member of a group with a specific disease if the attacker is familiar with the potential victim’s SNPs and the aggregate allele frequencies within that specific disease group.

The underlying assumption that the attacker is familiar with the potential victim’s SNPs, which can be obtained from a very small blood sample, is realistic because of the increasing availability in cost-effective genotyping services [4, 5]. Furthermore, Wang et al. [6] suggested that the allele frequency of the group SNP values can be determined from standard statistical data such as p-values or *χ*^{2} values. Consequently, an anonymization procedure should always be applied to *χ*^{2} values when publishing SNP datasets [6–8].

Data sharing in genemic research is very important [9]. To avoid such leakage of private information, we should execute a privacy protection mechanism on GWAS results. Existing studies add a relatively large amount of noise to GWAS results to protect privacy. However, our aim is to reduce the amount of noise while maintaining the same level of privacy protection. In other words, we can achieve the same level of privacy protection as existing studies with privacy-preserving *χ*^{2} testing and increase the usefulness of GWAS results.

The recent GWAS analysis methods are not limited to only the chi-squared test. For example, mixed linear model based methods have been used. However, the chi-squared test is still an important analysis method.

Although other methods for GWAS exist, a lot of recent research papers employ the chi-squared test for GWAS, such as [10–13], which were published in 2019 or 2020. Furthermore, the chi-squared test is used in numerous papers on GWAS to analyze COVID-19 [14–18]. Thus, because the chi-squared test has been adopted in many cases, it is worth studying.

Other tests, such as Kruskal-Wallis test and Wilcoxon test, are also employed for GWAS [19, 20]. Couch et al. [21] proposed differentially private methods for these tests. Dealing with other tests in our research remains an issue to be addressed in future work.

The most influential privacy metric within the privacy community is *ε*-differential privacy [22], which has been intensively investigated [23–26]. Several researchers, such as Fienberg et al. [27], Uhlerop et al. [28], and Yu et al. [7], have suggested approaches to facilitate sharing of *χ*^{2} values while conforming with *ε*-differential privacy parameters. However, these proposed methods are currently only applicable to 2×2 or 2×3 contingency tables. In other words, it is currently not possible to analyze contingency tables larger than 2×3. However, the requirement to analyze SNPs based on an *I*×*J* contingency table is crucial. For example, previous studies have evaluated higher degrees of freedom within a contingency table [8, 29]. However, these methods have relatively poor accuracy, particularly in cases with small sample populations. This condition applies to many situations where the sample sizes being considered can range from dozens to several hundred samples [30–32].

Although we live in an era of big data where datasets with a large number of samples are becoming available in many domains, obtaining sensitive information is still difficult due to privacy regulations such as General Data Protection Regulations (GDPR). Sensitive patient biomedical data cannot be shared without permission [33]. Moreover, there are a lot of rare diseases, and obtaining such information of the patients is very difficult [34, 35]. Further, it is difficult to collect a large number of samples when there is a need for rapid analysis for a new disease such as COVID-19. Someone might provide his or her sensitive information without any privatization schemes; however, more people would provide their sensitive information by conducting privatization schemes [36, 37]. Moreover, many studies [38–40] have considered contingency tables larger than 2 ×3. Therefore, private *χ*^{2} testing for large contingency tables with small samples is an important problem.

In this paper, we propose a new method, named RandChiDist, for anonymizing *χ*^{2} values for an *I*×*J* contingency table with a small number of samples, and we experimentally evaluate this method using real datasets. RandChiDist adds the minimized Laplace noise to the true *χ*^{2} value based on the contingency table and controls the ratio of Type I errors (i.e., false positives). The evaluation uses the synthetic and real datasets, including two genomic datasets. The evaluation shows that RandChiDist can control the ratio of Type I errors strictly and can reduce Type II errors (i.e., false negatives) more than existing methods that can control the ratio of Type I errors. Several methods reduce Type II errors more than RandChiDist; however, the methods cannot control the ratio of Type I errors.

Several approaches exist for non-private *χ*^{2} testing, and RandChiDist can be used to calculate the global sensitivity of the *χ*^{2} value of the simplest *χ*^{2} testing and to add noise to the *χ*^{2} value based on the global sensitivity. Thus, the added noise is minimized according to the Laplace mechanism theorem [22].

The motivation of this paper is summarized as follows. Chi-squared test can be employed for various data analyses, such as the identification of SNPs associated with certain diseases; however, publishing the chi-squared value can lead to privacy leakage. Thus, we propose a privacy-preserving chi-squared testing algorithm for a small number of samples due to the difficulty in collecting a large number of samples of a rare disease or new disease.

In our research, samples of less than about 1,000 in number are considered as a small sample size.

The rest of this paper is organized as follows: “Preliminaries” section introduces *χ*^{2} hypothesis test and differential privacy. “Related work” section discusses related work. “Proposed method” section presents our proposed method and “Evaluation” section presents the results of our simulations. “Discussion” Section discusses the evaluation results, and the need for adaptation to a large contingency table and a small sample. “Conclusion” section concludes the paper.

## Preliminaries

###
*χ*
^{2} hypothesis test of Independence

We consider a contingency table with *I* rows and *J* columns. Let [*i*,*j*] denote the *i*th row and *j*th column’s cell of the table. *O*_{i,j} represents the value of cell [*i*,*j*], and *E*_{i,j} represents the expected value of cell [*i*,*j*].

Let \(m_{i} \,=\,\! \sum _{j} O_{i,j}, s_{j} \,=\,\! \sum _{i}\! O_{i,j}\), and \(n \,=\,\! \sum _{i}\! m_{i} \,=\, \sum _{j} s_{j}\). Table 1 provides an example of a contingency table.

The *χ*^{2} value is calculated as

We determine the significance level *α* (i.e., the probability of a Type I error occurring) and the null hypothesis *H*_{0} in advance. We then calculate *χ*^{2} based on Eq. (1) and determine whether to reject *H*_{0} using the *χ*^{2} *distribution table*. Thus, \(\chi _{v}^{2}\) represents the probability density function of the *χ*^{2} distribution with *v* degrees of freedom. The *χ*^{2} distribution table presents the percentage point \(P(\chi _{v}^{2} > x) = \alpha \) for several combinations of *v* and *α*.

### Privacy model

In recent years, *ε*-differential privacy [22] has been considered the de facto standard for privacy metrics [33, 41–43].

The privacy parameter *ε* reflects the privacy level, with a large *ε* value indicating a low privacy. We consider neighboring databases to represent two databases differing by a maximum of one record. The *ε*-differential privacy is defined as follows:

###
**Definition 1**

(*ε*-differential privacy) Let *D* and *D*^{′} be neighboring databases. A randomized mechanism \(\mathcal {M}\) satisfies the *ε*-differential privacy if, for any *D* and *D*^{′} and any subset of outputs \(Y \subset Range(\mathcal {M})\), it holds that

The Laplace mechanism, which adds noise generated using a Laplace distribution, can satisfy Theorem 1 [22]. To explain this mechanism, we first outline the concept of global sensitivity.

###
**Definition 2**

(Global sensitivity) Let *f* be a function \(f : \mathcal {D} \to \mathbb {R}^{d}\), where \(\mathcal {D}\) is a collection of databases. When *f* satisfies for any neighoring databases *D* and *D*^{′}

the global sensitivity of *f* is *Δ**f*.

###
**Theorem 1**

(Laplace Mechanism [22]) A randomized mechanism \(\mathcal {M}\) realizes *ε*-differential privacy if \(\mathcal {M}\) outputs *f*(*D*)+*L**a**p*(*Δ**f*/*ε*), where *L**a**p*(*v*) returns independent Laplace random variables with scale parameter *v*.

## Related work

In *χ*^{2} testing, a contingency table such as Table 2 is used. This contingency table can be represented as Table 3, and Tables 2 and 3 are equivalent. In research on privacy-preserving *χ*^{2} testing, databases such as those shown in Table 3 are considered. For example, Tables 3 and 4 are neighboring databases because the tables contain the same data with exception of one record

Yu et al. [7] demonstrated that the global sensitivity of the *χ*^{2} value of 2×3 contingency tables can be calculated as

if *m*_{1} and *m*_{2} are known (i.e., published).

Fienberg et al. [27] and Uhlerop et al. [28] demonstrated that if *m*_{1}=*m*_{2}, the global sensitivity of the *χ*^{2} value can be calculated as

The global sensitivities, *Δ*_{F} and *Δ*_{Y}, have been shown to be optimal values. However, they can only be applied to 2×2 or 2×3 contingency tables.

Kakizaki et al. [44, 45] proposed a unit circle mechanism that can achieve a high degree of accuracy. However, they assumed only 2×2 contingency tables. Additionally, they did not publish the differentially private *χ*^{2} value used in their method; however, they did publish the differentially private result of the *χ*^{2} testing based on the given significance level, *α*. Therefore, if a data holder wants to publish the private *χ*^{2} testing results of several *α* values (e.g., *α*=0.05,0.01,0.005, and 0.001), the data holder must independently execute the privacy mechanisms multiple times (e.g., three times). Following the composition theorem [46], if a privacy mechanism outputs *K* times based on *ε*-differential privacy, the resulting privacy level thus becomes *K**ε* (i.e., the privacy level decreases). Moreover, Banerjee et al. [47] state that publishing P-value could be important for data analysis.

The aforementioned studies all assumed that *m*_{i} (*i*=1,…,*I*) is not sensitive information. We can share each value of *m*_{i} without privatization schemes.

Gaboardi et al. [8] proposed several methods for arbitrary contingency tables. First, they show a straightforward method that does not add Laplace noise to the *χ*^{2} value, but rather adds it to each cell of the contingency table with a global sensitivity of 2. In this paper, we name this method as RandCell. RandCell is also known as SNPpval, which was proposed by Jonson and Shmatikov [48]. The *χ*^{2} value of the contingency table to which RandCell adds Laplace noise tends to be large, meaning that RandCell yields many false positives. Therefore, Gaboardi et al. proposed several other methods known as PrivIndep, MCIndep with Laplace mechanism, and MCIndep with Gaussian mechanism. They showed that MCIndep with Laplace mechanism had the best performance of their proposed methods. Hence, we describe MCIndep with Laplace mechanism in detail in this paper and refer to MCIndep with Laplace mechanism as MCIndep for simplicity.

MCIndep generates many contingency tables randomly based on *m*_{i} and *s*_{j} of the contingency table with added Laplace noise and compares their *χ*^{2} values. The original contingency table can be considered to reject *H*_{0} if the *χ*^{2} value of the contingency table to which RandCell adds Laplace noise is greater than the top *α*×100*%* of the generated contingency tables’ *χ*^{2} value. Other methods for (*ε*,*δ*)-differential privacy are proposed [49], which relaxes the *ε*-differential privacy as their privacy metric. We focus on *ε*-differential privacy in this paper, and applying our method to (*ε*,*δ*)-differential privacy is an issue to be addressed in future work.

Sei et al. [50] proposed several theorems for differentially private *χ*^{2} testing, but there were no detailed proofs for the theorems and the equations provided in their study. Moreover, there were no experiments that evaluated the performance of *χ*^{2} testing.

More recently, Gaboardi et al. [29] proposed *χ*^{2} test algorithms (LocalNoiseIND, LocalExpIND, and LocalBitFlipIND) for privacy-preserving *χ*^{2} testing of independence based on local differential privacy. LocalNoiseIND is also known as zCDP general chi-squared test, which was proposed by Kifer and Rogers [51]. In their paper, they showed that LocalExpIND had the best performance of the three methods for most parameter settings. These methods can be applied to arbitrary contingency tables, and address a local model of privacy and assume there is no trusted entity. In this paper, we assume that a trusted entity has all the raw data.

Canonne et al. [52] calculated the sample complexity bounds of an *ε*-differentially private test for distinguishing between two distributions. They also applied differentially private change-point detection. Their method is for a parametric setting that requires that the two distributions are perfectly known. In contrast, our method can be used for a nonparametric setting.

Csail et al. [53] proposed an algorithm for testing the closeness of two distributions in a private manner. Their algorithm can also test the independence of two random variables. However, execution for privacy-preserving *χ*^{2} testing was not described.

Liu et al. [54] showed how *ε* influences the accuracy of differentially private hypothesis testing. They proposed a method to determine an appropriate value for *ε* that can be useful for determining the *ε* value for our proposed algorithm; however, determining *ε* is outside the scope of our paper.

Couch et al. [21] proposed a differentially private hypothesis testing method for the Kruskal-Wallis test, Mann-Whitney test, Wilcoxon test, and one-sample t-test. This hypothesis testing method is not for nominal scale data, which are suitable for *χ*^{2} testing, but rather for ordinal or interval scale data.

The methods for arbitrary *I*×*J* contingency tables have relatively poor accuracy, particularly in cases with small-sample populations. We show the comparison between existing methods and the proposed method in “Evaluation” section.

### Adversarial model

The adversarial model is described as follows. The server has a database, and it wants to share the result of the chi-squared test with data analysts who are potential attackers. The attacker is considered to be a semi-honest entity, that is, the attacker follows the protocol between the server. However, the attacker might attempt to extract individual information from the result of the chi-squared test.

## Proposed method

### Overview

We propose RandChiDist, which adds Laplace noise to the *χ*^{2} value obtained from a target contingency table. Calculating the Laplace noise to be added requires the global sensitivity of the *I*×*J* contingency table’s *χ*^{2} value. The method for calculating global sensitivity is described in 1.

Typically, the *χ*^{2} distribution table is used to determine whether to reject *H*_{0}. However, RandChiDist adds noise to the *χ*^{2} value, thus we need a modified *χ*^{2} distribution table. The method for calculating this is described in 1. RandChiDist uses this table to determine whether to reject *H*_{0}. We consider bounding the Type I error to be at most *α* to be a hard constraint.

Our main symbols are summarized in Table 5.

### Global sensitivity of *χ*
^{2} value

As was assumed in other studies, we assume that *m*_{i} (*i*=1,…,*I*) is also provided to a data analyzer. We consider contingency tables *D*_{1} and *D*_{2}, which are generated from neighboring databases. Because the neighboring databases differ by one record, their contingency tables differ by a maximum of two cells. The value of cell [*a*,*k*] in table *D*_{2} is greater than that of cell [*a*,*k*] in *D*_{1} by 1, and the value of cell [*a*,*l*] in table *D*_{2} is less than that of cell [*a*,*l*] (s.t. *l*≠*k*) in table *D*_{1} by 1.

Because the values of *m*_{i}(*i*=1,…,*I*) are released to the public, the collection of databases in Definition 2.2 only include databases that satisfy the released values of *m*_{i}, and the neighboring databases are elements of the collection. Therefore, the global sensitivity is calculated based on the neighboring databases that satisfy the released values of *m*_{i}.

Thus, we calculate the possible maximum value of the difference of *χ*^{2} values between tables *D*_{1} and *D*_{2}.

RandChiDist satisfies differential privacy by adding Laplace noise with global sensitivity because of Theorem 1. We thus propose RandChiDist, which adds Laplace noise with global sensitivity,

where

to the calculated *χ*^{2} value from (1). Here, we have the following theorem:

###
**Theorem 2**

RandChiDist satisfies *ε*-differential privacy.

###
*Proof*

We prove that *Δ*_{R} is the global sensitivity of *χ*^{2} of the *I*×*J* contingency table. We can then uphold Theorem 2 because RandChiDist adds *L**a**p*(*Δ*_{R}/*ε*) to the original value based on the Laplace mechanism theorem (Theorem 1).

Let *O*_{i,j}(*D*) denote the observed value of cell [*i*,*j*] in database *D* and let *χ*^{2}(*D*) denote the *χ*^{2} value of database *D*. Without a loss of generality, we consider neighboring databases *D*_{1} and *D*_{2}, which satisfy the following equations:

where *k* and *l* are arbitrary natural numbers satisfying *k*,*l*∈{1,…,*J*} and *k*≠*l*.

From Proposition 2, when *J* is greater than or equal to 3 and we are given the value *a*, neighboring databases that satisfy the following constraints maximize the difference between the *χ*^{2} values of tables *D*_{1} and *D*_{2} (see Fig. 1a).

From constraint (9), we understand that the sum of the kth column of *D*_{2} (i.e., *s*_{k} of *D*_{2}) is equal to *m*_{a}.

Let *V*_{i,j}(*D*) denote *V*_{i,j} in Eq. (1) for database *D*.

The symbol *b* is an arbitrary integer from 1 to *I* but not a. The symbol *l* is an arbitrary integer from 1 to *J* but not *k*.

The difference between the *χ*^{2} values of tables *D*_{1} and *D*_{2} that satisfies the constraint (9) is thus calculated by

Therefore, given *a*, when the value of *J* is greater than or equal to 3, global sensitivity is represented by Eq. (10). Moreover, from Proposition 1, global sensitivity is represented by Eq. (6) when the value of *J* is greater than or equal to 3 and *a* is not given.

When *J*=2 and *a* is given, neighboring databases that satisfy the following constraints will maximize the difference between the *χ*^{2} values of contingency tables *D*_{1} and *D*_{2} from Proposition 3 (see Fig. 1b).

The difference between the *χ*^{2} values of tables *D*_{1} and *D*_{2} that satisfy the constraint (11) can be calculated as

Because *n*^{2}/(*m*_{a}(*n*−*m*_{a}+1)) decreases when *m*_{a} decreases, the global sensitivity can be represented by Eq. (6) when *J* is equal to 2 and *a* is not given. □

When we use a 2×3 contingency table, *Δ*_{R} is identical to *Δ*_{Y}, and when we use a 2×3 contingency table with *m*_{1}=*m*_{2},*Δ*_{R} is identical *Δ*_{F}.

Propositions 1 and 2 used in the proof of Theorem 2 are described below.

###
**Proposition 1**

*Δ*_{R} in Eq. (6) is maximized when the minima (7) are satisfied.

###
*Proof*

By differentiating Eq. (6) with respect to *m*_{a}, we obtain

By differentiating Eq. (6) with respect to *m*_{b}, we obtain

From Expressions (13) and (14), *m*_{a} and *m*_{b} should thus be minimized to maximize Eq. (6).

Let *min* denote the minimum value in *m*_{i} (*i*=1,…,*I*) and let *m**i**n*+*x* denote the second most minimum value in *m*_{i} (*i*=1,…,*I*), where *x*≥0. If *m*_{a} is *min* and *m*_{b} is *m**i**n*+*x*, Eq. (6) can then be expressed as

If *m*_{a} is *m**i**n*+*x* and *m*_{b} is *min*, Eq. (6) can then be expressed as

Because Expression (15) is always greater than or equal to Expression (16), we find that the value of *Δ*_{R} in Eq. (6) is maximized when (7) is satisfied. □

###
**Proposition 2**

When *J* is greater than or equal to 3 and *a* is given, neighboring databases that satisfy the constraints (9) maximize the difference between the *χ*^{2} values of tables *D*_{1} and *D*_{2}.

###
*Proof*

There are many neighboring databases that satisfy Eq. (8); however, we prove that neighboring databases that satisfy the constraints (9) have the greatest difference, *δ*(*D*_{1},*D*_{2}), between *χ*^{2}(*D*_{1}) and *χ*^{2}(*D*_{2}) when *J*≥3. We assume that *m*_{i} is a fixed value for all values of *i*.

Thus, we write *O*_{i,j}(*D*_{1}) as *O*_{i,j} for any *i* and *j* in the following manner.

Following Lemma 1, *O*_{a,k} should be maximized to maximize *δ*(*D*_{1},*D*_{2}). As a result, the value of *O*_{a,k} becomes *m*_{a}−1 because of the constraints (17).

From Eq. (8) we have the following constraints:

Following Lemma 2, *O*_{i,k} should be zero to maximize *δ*(*D*_{1},*D*_{2}) for all values of *i* except *i*=*a*.

Following Lemma 3, *O*_{a,l} should be minimized to maximize *δ*(*D*_{1},*D*_{2}). As a result, the value of *O*_{a,l} becomes 1 because of the constraints (17).

Following Lemma 4, *O*_{μ,l}*μ*≠*a* should be *m*_{μ} and *O*_{i,l} for all *i*, except for *i*=*a*, and *i*=*μ* should be zero to maximize *δ*(*D*_{1},*D*_{2}).

As a result, we can maximize *δ*(*D*_{1},*D*_{2}) when tables *D*_{1} and *D*_{2} satisfy the constraints (9) by replacing *μ* in Lemma 4 with *b*. □

###
**Lemma 1**

To maximize *δ*(*D*_{1},*D*_{2}),*O*_{a,k} should be maximized (and correspondingly, *O*_{a,r} for all *r*, except for *r*=*k*,*l*, should be adjusted to satisfy *m*_{a}).

###
*Proof*

We have

By differentiating Eq. (18) with respect to *O*_{a,k}, we obtain

because we have

Because Expression (19) is always ≥0, Eq. (18) increases as *O*_{a,k} increases.

Thus, we can say that *O*_{a,k} should be increased to maximize *δ*(*D*_{1},*D*_{2}). As a result, we have *O*_{a,k}=*m*_{a}−1. □

###
**Lemma 2**

To maximize *δ*(*D*_{1},*D*_{2}),*O*_{i,k} should be minimized (and correspondingly *O*_{i,r} for all *r*, except for *r*=*k*,*l*, should be adjusted to satisfy *m*_{i}) for all values of *i* except for *i*=*a*.

###
*Proof*

We focus on *μ*∈{1,…,*I*} such that *μ*≠*a*. By differentiating Eq. (18) with respect to *O*_{μ,k}, we obtain

because we have

Let \(\Theta = \sum _{i\neq a,\mu } O_{i,k}^{2}/m_{i}\). By solving equation (21) =0 for *Θ*, we obtain

Expression (23) is always greater than zero. When *Θ*=0 in Expression (21), the value of Expression (21) is less than 0.

Thus, when *Θ* is less than Expression (23), Expression (21) is less than zero. Similarly, when *Θ* is greater than Expression (23), Expression (21) is greater than zero. That is, to maximize Eq. (18), the value of *O*_{μ,k} should be either minimized or maximized. From this observation, to maximize Eq. (18), we can say that *O*_{i,k} should be either minimized (i.e., zero) or maximized (i.e., *m*_{i}) for all *i* except for *i*=*a*,.

From Lemma 1, we have *O*_{a,k}=*m*_{a}−1. Therefore, when *O*_{i,k}=0 for all *i* except *i*=*a*, we have *s*_{k}=*m*_{a}−1. In this case, *δ*(*D*_{1},*D*_{2}) is

In contrast, when *O*_{i,k}=*m*_{i} for all *i* except \(i=a, s_{k} = \sum _{i} m_{i}-1 = n-1\). In this case, *δ*(*D*_{1},*D*_{2}) is

By subtracting Expression (25) from Expression (24), we obtain

Because Expression (26) is always greater than zero, *O*_{i,k} for all *i* except *i*=*a* should be zero. □

###
**Lemma 3**

To maximize *δ*(*D*_{1},*D*_{2}),*O*_{a,l} should be minimized (and correspondingly *O*_{a,r} for all *r* except *r*=*k*,*l* should be adjusted to satisfy *m*_{a}).

###
*Proof*

By differentiating Eq. (18) with respect to *O*_{a,l}, we obtain

because we have

Because (27) is always less than zero, Eq. (18) increases as *O*_{a,l} decreases. □

###
**Lemma 4**

To maximize *δ*(*D*_{1},*D*_{2}),*O*_{μ,l} (*μ*≠*a*) should be maximized (and correspondingly, *O*_{μ,r} for all *r* except *r*=*k*,*l* should be adjusted to satisfy *m*_{μ}). Additionally, *O*_{i,l} should be minimized (and correspondingly, *O*_{i,r} for all *r* except *r*=*k*,*l* should be adjusted to satisfy *m*_{i}) for all *i* except *i*=*a* and *i*=*μ*.

###
*Proof*

By differentiating Eq. (18) with respect to *O*_{μ,l}(*μ*≠*a*), we obtain

because we have

Let \(\Theta = \sum _{i\neq a,\mu } O_{i,l}^{2}/m_{i}\). We have *O*_{a,l}=1 from Lemma 3. By solving Expression (29)=0 for *Θ*, we obtain

Expression (31) is always greater than zero. When *Θ*=0 and *O*_{a,l}=1 in Expression (29), (29) can be expressed as

Therefore, when *Θ* is less than or equal to Expression (31), Expression (29) is greater than zero, and when *Θ* is greater than Expression (31), Expression (29) is less than zero. Thus, to maximize Eq. (18), the value of *O*_{μ,l} should be either minimized (i.e., zero) or maximized (i.e., *m*_{μ}).

Thus, to maximize *δ*(*D*_{1},*D*_{2}), the value of *O*_{μ,l} should be either minimized or maximized. Let us have \(x = \sum _{i\neq \mu } O_{i,l}\). When *O*_{μ,l} is maximized (i.e., *O*_{μ,l}=*m*_{μ}), we *δ*(*D*_{1},*D*_{2}) is

In contrast, when *O*_{μ,l} is minimized (i.e., *O*_{μ,l}=0), *δ*(*D*_{1},*D*_{2}) is

By subtracting Expression (34) from Expression (33), we obtain

When *I*=2, the second term of Expression (35) is zero. Therefore, Expression (35) is always greater than zero and Lemma 4 holds when *I*=2.

We then consider the situation where *I*≥3. We assume that *O*_{i,l} is zero for all values of *i* except *i*=*a* and *i*=*μ*. In this case the second term of Expression (35) is zero and the first term of Expression (35) is greater than zero; therefore, we can say that Expression (35) is always greater than zero. Thus, *O*_{μ,l} should be maximized to *m*_{μ} when *O*_{i,l} is zero for all values of *i* except *i*=*a* and *i*=*μ*.

Next, we focus on *v* such that *v*∈{1,…,*I*} and *v*≠*a*,*μ*. We demonstrate that Expression (35) is always ≤ 0 when *O*_{v,l} is maximized to *m*_{v}. Additionally, the second term of Expression (35) is minimized when *I*=3. In this case, we obtain

because *x*=*m*_{v}+1.

Therefore, each *O*_{i,l} for all *i* except *i*≠*a*,*μ* should be minimized to zero.

From this observation, Lemma 4 also holds when *I*≥3. □

###
**Proposition 3**

When *J* equals 2 and *a* is given, neighboring databases that satisfy the constraints (11) maximize the difference between the *χ*^{2} values of tables *D*_{1} and *D*_{2}.

The proof can be conducted in a similar manner as Lemma 2.

### Differentially private hypothesis testing

We can now calculate the anonymized *χ*^{2} value from an original table,

where *χ*^{2}^{∗} is the anonymized *χ*^{2} value.

From the definitions of the Laplace distribution and *χ*^{2} distribution, the probability density function of a *χ*^{2} value possessing *v* degrees of freedom with the addition of Laplace noise and global sensitivity *Δ* can be expressed as

where

and

where *Γ*(*v*/2) represents the *v*/2 gamma function, that is,

When we set the significance level to *α*, our proposed RandChiDist rejects *H*_{0} if the *χ*^{2} value calculated using Eq. (1), with the addition of Laplace noise and the scale *Δ*_{R}/*ε*, is greater than or equal to *α*, as calculated by solving the following equation with regard to *α*;

Lastly, we compare the *χ*^{2}^{∗} value calculated using Eq. (37) to the *t* value calculated using Eq. (43). When *χ*^{2}^{∗} is greater than or equal to *t*, RandChiDist outputs “reject the null hypothesis *H*_{0},” and otherwise outputs “fail to reject the null hypothesis *H*_{0}.”

Algorithm 1 shows the overall RandChiDist algorithm.

If we want an anonymized version of the *p* value, RandChiDist calculates and outputs

The data analysis can thus conduct a *χ*^{2} hypothesis test using an arbitrary *α* by comparing Expression (44) and *α*.

### Complexity analysis

Calculating original *χ*^{2} yields a computational complexity of *O*(*I*×*J*). Calculating global sensitivity *Δ**R* requires finding the largest value and the second largest value of *m*_{i}(*i*=1,…,*I*); therefore, the computational complexity is *O*(*I*). Calculating (43) and (44) requires the calculation of an integration. For example, Monte Carlo integration can be adopted to calculate an integration. The computational complexity of Monte Carlo integration is not influenced by the cross table. There are numerous Monte Carlo integration methods that can be calculated extremely fast [55].

Therefore, the computational complexity of the proposed algorithm is *O*(*I*×*J*+*M*), where *M* denotes the computational complexity of calculating an integration.

## Evaluation

We compared RandChiDist, RandCell, MCIndep, and LocalExpIND as described in “Related work” section.

LocalExpIND was proposed especially for local privacy; therefore, LocalExpIND can be used for more scenarios than RandChiDist. Thus, the local model of privacy is another avenue for future exploration.

Moreover, to clarify the contribution of calculating the private *χ*^{2} distribution table’s value (proposed in “Differentially private hypothesis testing” section), we also compared a method that uses the global sensitivity *Δ*_{R} calculated using Eq. (6) that does not use the private *χ*^{2} distribution table’s value calculated using Eq. (24). We refer to this method as RandChi, which is also proposed in this paper.

The source code for the RandChi and RandChiDist methods can be obtained from https://uecdisk.cc.uec.ac.jp/index.php/s/pic3T9GEp03qy6y.

We should use Bonferroni’s corrected threshold when conducting multiple *χ*^{2} testing [56]. In this paper, we conducted many *χ*^{2} tests; however, we consider each to be independent. Thus, Bonferroni’s corrected threshold was not used in this paper to compare the performance among our proposed methods and methods from existing studies for independent *χ*^{2} testing. This paper shows the average results of each independent *χ*^{2} test. Additionally, previous studies of privacy-preserving *χ*^{2} testing, such as [7, 8, 27–29, 44, 45], did not use the Bonferroni’s corrected threshold.

We varied the values of *n* from 100 to 900, *α* from 0.005 to 0.05, and *ε* from 0.01 to 10. We set the parameters of MCIndep the same as in [8].

### Significance results

We first evaluated the significance to confirm that RandChiDist guarantees a significance of at least 1−*α*. We randomly generated 2×2 contingency tables based on a multinomial distribution with probabilities of (0.25, 0.25, 0.25, 0.25) 1,000 times. Each time, we evaluated whether each method correctly output “fail to reject the null hypothesis *H*_{0}.” Figure 2 shows the results with an *ε* value of 0.1. The significance of each method should be approximately 1−*α*.

The significance levels of RandChiDist, MCIndep, and LocalExpIND were controlled around 1−*α* for any *n*, *ε*, and *α* values. In contrast, RandCell and RandChi had significance values much less than 1−*α* when *ε* was less than 1.

We conducted the same experiments for randomly generated 4×4 contingency tables based on a multinomial distribution with probabilities of 1/16,…,1/16. Figure 3 shows the results with *ε*=0.1. As with the 2×2 contingency tables, the significance values of RandChiDist, MCIndep, and LocalExpIND were approximately 1−*α*. In contrast, RandCell and RandChi significance values were less than 1−*α*, especially when *ε* was small.

RandCell adds a Laplace noise to each cell. The probability that at least one Laplace noise becomes very large increases when the number of cells is large. Therefore, RandCell has many false positives (i.e., significance results are small) when contingency tables are large. On the other hand, if the Laplace noise is a large negative value, the cell value with the noise could be less than five (or negative). In this case, RandCell fails to reject the null hypothesis based on the rule of thumb. Therefore, RandCell’s results of 4 ×4 tables are smaller than those of 2 ×2 tables only when *n* is large.

RandChi’s significance results do not vary greatly by the table size or *n*. This is because the global sensitivity calculated from Eq. 7 also does not vary greatly by the table size or *n*.

### Power results

We then evaluated each method’s power. The values of parameters *α*,*ε*, and *n* were identical to those in the significance experiments; however, we randomly generated 2×2 contingency tables based on a multinomial distribution with probabilities of (0.25+0.01,0.25−0.01,0.25−0.01,0.25+0.01) and (0.25+0.15,0.25−0.15,0.25−0.15,0.25+0.15). We also used another probability set (0.3+0.15,0.3−0.15,0.2−0.15,0.2+0.15) to determine whether RandChiDist can be applied to unbalanced tables. Moreover, we randomly generated 3×4 contingency tables based on a multinomial distribution with probabilities of (1/12+0.07,1/12−0.07,1/12,1/12,1/12−0.07,1/12+0.07,1/12,…,1/12). Each time, we evaluated whether each method correctly output “reject the null hypothesis *H*_{0}.” Figures 4, 5, 6, and 7 show the results for *ε*=0.1.

In the experiment on a multinomial distribution with probabilities of (0.25+0.01,0.25−0.01,0.25−0.01,0.25+0.01), the empirical power of Non-private, which does not consider privacy at all, is very low, which is approximately from 0 to 0.2. Hence, all privacy-preserving algorithms that can control Type I errors do not realize high empirical power, although RandChiDist, which we proposed, is just slightly better than the other algorithms.

In the experiments on other multinomial distributions, MCIndep has relatively low empirical power. MCIndep generated many contingency tables from its algorithm based on the original contingency table. MCIndep quickly outputs “fail to reject *H*_{0}” when at least one cell in the generated contingency tables has a value of less than five. Therefore, even if all the target contingency table’s cells have values greater than five, MCIndep is likely to output “fail to reject *H*_{0}” if several values are close to 5 (for example a value of 10).

In contrast, RandCell and RandChi both achieved high empirical power at the expense of empirical significance. The empirical power of MCIndep is high when there are many samples and the data are uniformly distributed. RandChiDist achieved higher empirical power with fewer samples than MCIndep while also achieving empirical significance.

In hypothesis testing that includes *χ*^{2} testing, we should avoid Type I errors (i.e., false positives). In general, we adjust the Type I error probability by the value of *α* (e.g., 0.05). Even if the empirical power is high, the algorithm is of no use if the empirical significance is less than 1- *α*. The empirical power of RandCell and RandChi is greater than that of RandChiDist; however, RandCell and RandChi have empirical significance values much than 1- *α*. That is, they cannot control Type I errors (false positives) in many cases. Therefore, we can conclude that RandChiDist outperforms RandCell and RandChi. Among RandChiDist, MCIndep, and LocalExpIND, which can control Type I errors, RandChiDist has the highest power.

### Results of real datasets

We used two genomic datasets^{Footnote 1}. The first dataset is the Human Genome Diversity Project genotype dataset (HGDP) used by Conrad [57], which consists of 2,834 SNPs and has 1,244 records after the records in which unknown values are eliminated. The other is the International Haplotype Map Project genotype dataset (HapMap) used by [58], which consists of 1,853 SNPs and has 420 complete records.

We randomly generated contingency tables for linkage disequilibrium analysis for each dataset and set the numbers of columns and rows to four. Following the “rule of thumb,” if any values of the created contingency table are less than five, we re-created another contingency table and then conducted normal *χ*^{2} testing on the original contingency tables. We then carried out the privacy-preserving methods. We generated contingency tables and conducted *χ*^{2} testing 100 times, and then calculated the mean results of false positive and false negative rates.

The results for the HGDP genotype and HapMap genotype datasets are shown in Figs. 8 and 9, respectively. RandChiDist outperformed MCIndep and LocalExpIND for most of the parameter settings used in this paper.

## Discussion

According to the evaluation results, RandCell and RandChi could not control the ratio of Type I errors—that is, they caused a lot of false positives. On the contrary, RandChiDist, MCIndep, and LocalExpIND could control the ratio of Type I errors. RandChiDist achieved the least number of Type II errors among RandChiDist, MCIndep, and LocalExpIND. When testing a hypothesis, data analyzers determine the significance level *α* (i.e., the ratio of Type I errors) ahead of time. That is, they reject a true null hypothesis with a probability no greater than *α*. A high false positive rate means that a true null hypothesis is rejected with a probability greater than *α*, which leads to the false interpretation of datasets. Therefore, if we want to avoid such false interpretations, RandChiDist is the preferred method.

There are several approaches for non-private *χ*^{2} testing. The simplest approach is shown in “*χ*^{2} hypothesis test of Independence” section. RandChi and RandChiDist calculate the global sensitivity of the *χ*^{2} value of the simplest chi-squared testing and adds noise based on the global sensitivity to the *χ*^{2} value. Thus, the added noise is minimized following the Laplace mechanism theorem (Theorem 1).

In contrast, RandCell calculates the global sensitivity of each value of each cell and adds noise to each value. The summation of added noises thus become very large. MCIndep takes another approach for calculating non-private *χ*^{2} testing, as shown in “Related work” section. MCIndep first estimates the parameters of the underlying multinomial distribution generating the samples. By the estimated the multinomial distribution, MCIndep generates more than 1/*α* contingency tables. When the number of samples is small, the estimated parameters of the underlying multinomial distribution have low accuracy. Because of this low accuracy estimation, MCIndep could have low performance when the number of samples is small. LocalExpIND assumes that each piece data is anonymized for each person and that there is no trusted entity. Because noise is added to each data point, the amount of the total noise becomes large.

Sharpe claimed that if we can avoid *χ*^{2} hypothesis testing for contingency tables larger than 2 ×2, doing so is desirable [59]. However, he showed an understanding that in some cases we could not avoid this and also reported that approximately 30% of *χ*^{2} tests are conducted for contingency tables larger than 2 ×2. This is based on his survey of journals published by the American Psychological Association for 2012, 2013, and early 2014. *χ*^{2} hypothesis testing has been widely used for GWAS as well as many other personal databases [38–40]. Moreover, some studies [38–40] have considered contingency tables larger than 2 ×3. Therefore, we consider the application of *ε*-differential privacy to *χ*^{2} hypothesis testing for contingency tables larger than 2 ×3 to be an important issue.

Our proposed method can be used not only GWAS but also other private data analysis for small samples. For example, the characteristics of COVID-19 patients (*n*=403) (the number of died patients was 100 and the number of recovered patients was 303) were analyzed by *χ*^{2} test with *α* being 0.05 [60]. Poyiadi et al. analyzed the COVID-19 with acute pulmonary embolism and the COVID-19 without acute pulmonary embolism [61]. The number of patients was *n*=328. They conducted *χ*^{2} test with *α* being 0.05. The influence on sexual activity for COVID-19 was analyzed by Jacob et al. [62]. The number of samples was 868. As these studies show, there is a high need for testing with a small sample size. In particular, it is difficult to collect a large number of samples when there is a need for rapid analysis for a new disease such as COVID-19.

We assume that the data holder publishes *m*_{i} as well as the differentially private chi-square value. In general, the information of *m*_{i} and a sample size is necessary to interpret a chi-square value accurately [63]. For example, even if in the case of trivial differences between two datasets, a very small chi-square value is obtained when every *m*_{i} is very large [64]. Therefore, *m*_{i} is very useful information for data analysts.

Publishing *m*_{i} also provides several other types of information. For example, we know that *O*_{i,j} for all *j* are less than or equal to *m*_{i}. However, we cannot know each value of *O*_{i,j}, and we cannot know which value is greater (*O*_{i,j} or \(\phantom {\dot {i}\!}O_{i,j'}\)) for any *j* or *j*^{′}, even if we know *m*_{i} and the (differentially private) chi-square value. Our proposed algorithm can protect chi-square values based on differential privacy, and we can ensure that it is impossible to reconstruct the original cross table. To the best of our knowledge, no researchers have claimed that publishing *m*_{i} could cause privacy issues.

## Conclusion

*χ*^{2} testing is widely used in GWAS and other types of data analysis. We proposed the RandChiDist method, which anonymizes the *χ*^{2} value of contingency tables. If we have a lot of samples for data analysis, it is easy to conduct statistical analysis precisely. However, obtaining highly sensitive data is quite difficult due to privacy reasons.Existing methods on privacy-preserving *χ*^{2} testing such as MCIndep are a better choice when the number of samples *n* is large; however, we demonstrated that RandChiDist outperforms existing methods when *n* is small.

Future work will include evaluating other relevant datasets. We also plan to apply our method to other hypothesis testing methods such as Student’s *t*-test and Fisher’s exact test.

## Availability of data and materials

All data generated or analysed during this study are included in this published article.

## Notes

https://web.stanford.edu/group/rosenberglab/hgdpsnpDownload.html(accessed May 26, 2017)

## References

Wu X, Dong H, Luo L, Zhu Y, Peng G, Reveille JD, Xiong M. A Novel Statistic for Genome-Wide Interaction Analysis. PLoS Genet. 2010; 6(9):1001131.

Hoh J, Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet. 2003; 4(9):701–9.

Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, Pearson JV, Stephan DA, Nelson SF, Craig DW, Egeland T, Dalen I, Mostad P, Hu Y, Fung W, Balding D, Clayton T, Whitaker J, Sparkes R, Gill P, Cowell R, Lauritzen S, Mortera J, Pearson J, Huentelman M, Halperin R, Tembe W, Melquist S, Bill M, Gill P, Curran J, Clayton T, Pinchin R, Jobling M, Gill P, Ladd C, Lee H, Yang N, Bieber F, Goodwin W, Linacre A, Vanezis P, Coble M, Just R, O’Callaghan J, Letmanyi I, Peterson C, Parsons T, Coble M, Just R, Irwin J, O’Callaghan J, Saunier J, Coble M, Vallone P, Just R, Coble M, Butler J, Parsons T, Kidd K, Pakstis A, Speed W, Grigorenko E, Kajuna S, Kennedy G, Matsuzaki H, Dong S, Liu W, Huang J, Macgregor S, Zhao Z, Henders A, Nicholas M, Montgomery G, Chakraborty R, Meagher T, Smouse P, Weir B, Triggs C, Starling L, Stowell L, Walsh K. Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays. PLoS Genet. 2008; 4(8):1000167.

Dorfman R, Mamzer-Bruneel M-F, Vogt G, Hervé C, Izatt L, Jacobs C, Donaldson A, Brady A, Cuthbert A, Harrison R. Falling prices and unfair competition in consumer genomics. Nat Biotechnol. 2013; 31(9):785–6.

Savage N. Privacy: The myth of anonymity. Nature. 2016; 537(7619):70–72.

Wang R, Li YF, Wang X, Tang H, Zhou X. Learning your identity and disease from research papers: information leaks in genome wide association study. In: Proc. ACM CCS. New York City: Association for Computing Machinery: 2009. p. 534–44.

Yu F, Fienberg SE, Slavković AB, Uhler C. Scalable privacy-preserving data sharing methodology for genome-wide association studies,. J Biomed Informa. 2014; 50:133–41.

Gaboardi M, woo Lim H, Rogers R, Vadhan S. Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing. In: Proc. ICML. Cambridge: Journal of Machine Learning Research, Inc.: 2016.

Pereira S, Gibbs R, McGuire A. Open Access Data Sharing in Genomic Research. Genes. 2014; 5(3):739–47. https://doi.org/10.3390/genes5030739.

Terao C, Momozawa Y, Ishigaki K, Kawakami E, Akiyama M, Loh P-R, Genovese G, Sugishita H, Ohta T, Hirata M, Perry JRB, Matsuda K, Murakami Y, Kubo M, Kamatani Y. GWAS of mosaic loss of chromosome Y highlights genetic effects on blood cell differentiation. Nat Commun. 2019; 10(1). https://doi.org/10.1038/s41467-019-12705-5.

Schmidt-Kastner R, Guloksuz S, Kietzmann T, van Os J, Rutten BPF. Analysis of GWAS-Derived Schizophrenia Genes for Links to Ischemia-Hypoxia Response of the Brain. Front Psychiatry. 2020; 11. https://doi.org/10.3389/fpsyt.2020.00393.

Lee K-Y, Leung K-S, Ma SL, So HC, Huang D, Tang NL-S, Wong M-H. Genome-Wide Search for SNP Interactions in GWAS Data: Algorithm, Feasibility, Replication Using Schizophrenia Datasets. Front Genet. 2020; 11. https://doi.org/10.3389/fgene.2020.01003.

Yuan J, Xing H, Lamy AL, Lencz T, Pe’er I. Leveraging correlations between variants in polygenic risk scores to detect heterogeneity in GWAS cohorts. PLOS Genet. 2020; 16(9). https://doi.org/10.1371/journal.pgen.1009015.

Armstrong J, Rudkin JK, Allen N, Crook DW, Wilson DJ, Wyllie DH, O’Connell AM. Dynamic linkage of COVID-19 test results between Public Health England’s Second Generation Surveillance System and UK Biobank. Microb Genom. 2020; 6(7). https://doi.org/10.1099/mgen.0.000397.

Shelton JF, Shastri AJ, Ye C, Weldon CH, Filshtein-Somnez T, Coker D, Symons A, Esparza-Gordillo J, Team C, Aslibekyan S, Auton A. Trans-ethnic analysis reveals genetic and non-genetic associations with COVID-19 susceptibility and severity. medRxiv. 2020:2020–090420188318. https://doi.org/10.1101/2020.09.04.20188318.

Asselta R, Paraboschi EM, Mantovani A, Duga S. ACE2 and TMPRSS2 Variants and Expression as Candidates to Sex and Country Differences in COVID-19 Severity in Italy. SSRN Electron J. 2020. https://doi.org/10.2139/ssrn.3559608.

Galmés S, Serra F, Palou A. Current State of Evidence: Influence of Nutritional and Nutrigenetic Factors on Immunity in the COVID-19 Pandemic Framework. Nutrients. 2020; 12(9):2738. https://doi.org/10.3390/nu12092738.

Das R, Ghate SD. Investigating the likely association between genetic ancestry and COVID-19 manifestations. medRxiv. 2020;:20054627. https://doi.org/10.1101/2020.04.05.20054627.

Ren W-L, Wen Y-J, Dunwell JM, Zhang Y-M. pKWmEB: integration of Kruskal–Wallis test with empirical Bayes under polygenic background control for multi-locus genome-wide association study. Heredity. 2018; 120(3). https://doi.org/10.1038/s41437-017-0007-4.

Casto AM, Feldman MW. Genome-Wide Association Study SNPs in the Human Genome Diversity Project Populations: Does Selection Affect Unlinked SNPs with Shared Trait Associations?PLoS Genet. 2011; 7(1). https://doi.org/10.1371/journal.pgen.1001266.

Couch S, Kazan Z, Shi K, Bray A, Groce A. Differentially private nonparametric hypothesis testing. In: Proc. ACM CCS. New York City: Association for Computing Machinery: 2019. p. 737–51.

Dwork C, McSherry F, Nissim K, Smith A. Calibrating Noise to Sensitivity in Private Data Analysis. In: Proc. Theory of Cryptography (TCC). Berlin: Springer: 2006. p. 265–84.

Ren H, Li H, Liang X, He S, Dai Y, Zhao L. Privacy-Enhanced and Multifunctional Health Data Aggregation under Differential Privacy Guarantees. Sensors. 2016; 16(9):1463. https://doi.org/10.3390/s16091463.

Sei Y, Ohsuga A. Differential Private Data Collection and Analysis Based on Randomized Multiple Dummies for Untrusted Mobile Crowdsensing. IEEE Trans Inf Forensic Secur. 2017; 12(4):926–39.

Liu Y, Wang H, Peng M, Guan J, Xu J, Wang Y. DeePGA: A Privacy-Preserving Data Aggregation Game in Crowdsensing via Deep Reinforcement Learning. IEEE Internet Things J. 2020. https://doi.org/10.1109/jiot.2019.2957400.

Ukil A, Jara AJ, Marin L. Data-Driven Automated Cardiac Health Management with Robust Edge Analytics and De-Risking. Sensors. 2019; 19(12):2733–1273318. https://doi.org/10.3390/s19122733.

Fienberg SE, Slavkovic A, Uhler C. Privacy Preserving GWAS Data Sharing. In: Proc. IEEE International Conference on Data Mining Workshops. New York City: Institute of Electrical and Electronics Engineers: 2011. p. 628–35.

Uhlerop C, Slavković A, Fienberg SE, Uhler C, Slavković A, Fienberg SE. Privacy-Preserving Data Sharing for Genome-Wide Association Studies. J Privacy Confidentiality. 2013; 5(1):137–66.

Gaboardi M, Rogers R. Local Private Hypothesis Testing: Chi-Square Tests. In: Proc. ICML. Cambridge: Journal of Machine Learning Research, Inc.: 2018. p. 1626–35.

Kohutek ZA, Wu AJ, Zhang Z, Foster A, Din SU, Yorke ED, Downey R, Rosenzweig KE, Weber WA, Rimner A. FDG-PET maximum standardized uptake value is prognostic for recurrence and survival after stereotactic body radiotherapy for non-small cell lung cancer. Lung Cancer. 2015; 89(2):115–20.

and others, Shi SQ, White MJ, Borsetti HM, Pendergast JS, Hida A, Ciarleglio CM, De Verteuil PA, Cadar AG, Cala C, McMahon D. Molecular analyses of circadian gene variants reveal sex-dependent links between depression and clocks. Transl Psychiatry. 2017; 6(3):748.

Möckel M, Schindler R, Knorr L, Müller C, Heller Jr G, Störk TV, Frei U. Prognostic value of cardiac troponin T and I elevations in renal disease patients without acute coronary syndromes: a 9-month outcome analysis. Nephrol Dial Transplant Off Publ Eur Dial Transplant Assoc Eur Ren Assoc. 1999; 14(6):1489–95.

Kim JW, Jang B, Yoo H. Privacy-preserving aggregation of personal health data streams. PLoS ONE. 2018; 13(11):0207639. https://doi.org/10.1371/journal.pone.0207639.

Schieppati A, Henter JI, Daina E, Aperia A. Why rare diseases are an important medical and social issue. Lancet. 2008; 371(9629):2039–41. https://doi.org/10.1016/S0140-6736(08)60872-7.

Nguengang Wakap S, Lambert DM, Olry A, Rodwell C, Gueydan C, Lanneau V, Murphy D, Le Cam Y, Rath A. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur J Hum Genet. 2020; 28(2):165–73. https://doi.org/10.1038/s41431-019-0508-0.

Capponi A, Fiandrino C, Kantarci B, Foschini L, Kliazovich D, Bouvry P. A Survey on Mobile Crowdsensing Systems: Challenges, Solutions, and Opportunities. IEEE Commun Surv Tutorials. 2019; 21(3):2419–65. https://doi.org/10.1109/COMST.2019.2914030.

Gao H, Xu H, Zhang L, Zhou X. A Differential Game Model for Data Utility and Privacy-Preserving in Mobile Crowdsensing. IEEE Access. 2019; 7:128526–33. https://doi.org/10.1109/ACCESS.2019.2940096.

Bosu A, Carver JC, Bird C, Orbeck J, Chockley C. Process Aspects and Social Dynamics of Contemporary Code Review: Insights from Open Source Development and Industrial Practice at Microsoft. IEEE Trans Softw Eng. 2017; 43(1):56–75.

Pantforder D, Vogel-Heuser B, Grams D, Schweizer K. Supporting Operators in Process Control Tasks–Benefits of Interactive 3-D Visualization. IEEE Trans Human-Machine Syst. 2016; 46(6):895–907.

Mukherjee P, Jansen BJ. Information Sharing by Viewers Via Second Screens for In-Real-Life Events. ACM Trans Web. 2017; 11(1):1–24.

Ren X, Yu CM, Yu W, Yang S, Yang X, McCann JA, Yu PS. LoPub : High-dimensional crowdsourced data publication with local differential privacy. IEEE Trans Inf Forensics Secur. 2018; 13(9):2151–66. https://doi.org/10.1109/TIFS.2018.2812146. http://arxiv.org/abs/arXiv:1612.04350v2.

Torra V. Random dictatorship for privacy-preserving social choice. Int J Inf Secur. 2019:1–9. https://doi.org/10.1007/s10207-019-00474-7.

Grining K, Klonowski M, Syga P. On practical privacy-preserving fault-tolerant data aggregation. Int J Inf Secur. 2019; 18(3):285–304. https://doi.org/10.1007/s10207-018-0413-5.

Kakizaki K, Fukuchi K, Sakuma J. Differential Privacy Based on Geometrical Interpretation of Chi-squared Testing. In: Computer Security Symposium. Tokyo: Information Processing Society of Japan: 2016. p. 1199–206.

Kakizaki K, Fukuchi K, Sakuma J. Differentially private chi-squared test by unit circle mechanism. In: Proc. ICML. Cambridge: Journal of Machine Learning Research, Inc.: 2017. p. 1761–70.

McSherry F, Talwar K. Mechanism Design via Differential Privacy. In: Proc. IEEE FOCS. New York City: Institute of Electrical and Electronics Engineers: 2007. p. 94–103.

Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS, Chaudhury S. Hypothesis testing, type I and type II errors. Ind Psychiatry J. 2009; 18(2):127.

Johnson A, Shmatikov V. Privacy-preserving data exploration in genome-wide association studies. In: Proc. ACM KDD. New York City: Association for Computing Machinery: 2013. p. 1079–87.

Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M. Our data, ourselves: privacy via distributed noise generation. In: Proc. Eurocrypt, vol. 4004. Berlin: Springer: 2006. p. 486–503.

Sei Y, Ohsuga A. Privacy-Preserving Chi-Squared Testing for Genome SNP Databases. In: Proc. 39th International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE EMBC): 2017. https://doi.org/10.1109/EMBC.2017.8037705.

Kifer D, Rogers R. A New Class of Private Chi-Square Tests. In: Proc. International Conference on Artificial Intelligence and Statistics (AISTATS). Cambridge: Journal of Machine Learning Research, Inc.: 2017. p. 991–1000.

Canonne CL, Kamath G, McMillan A, Smith A, Ullman J. The structure of optimal private tests for simple hypotheses. In: Proc. ACM STOC. New York City: Association for Computing Machinery: 2019. p. 310–21.

Csail MA, Diakonikolas I, Kane D, Rubinfeld R. Private Testing of Distributions via Sample Permutations. In: Proc. NeurIPS. La Jolla: Neural Information Processing Systems Foundation, Inc.: 2019. p. 10878–89.

Liu C, He X, Chanyaswad T, Wang S, Mittal P. Investigating Statistical Privacy Frameworks from the Perspective of Hypothesis Testing. In: Proc. PET. Warsaw: Sciendo: 2019. p. 233–54.

Atanassov E, Dimov IT. What Monte Carlo models can do and cannot do efficiently?,. Appl Math Model. 2008; 32(8):1477–500.

Cabin RJ, Mitchell RJ. To Bonferroni or Not to Bonferroni: When and How Are the Questions. Bull Ecol Soc Am. 2000; 81(3):246–248.

Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, Rosenberg NA, Pritchard JK. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat Genet. 2006; 38(11):1251–60.

Pemberton TJ, Jakobsson M, Conrad DF, Coop G, Wall JD, Pritchard JK, Patel PI, Rosenberg NA. Using population mixtures to optimize the utility of genomic databases: linkage disequilibrium and association study design in India. Ann Hum Genet. 2008; 72(4):535–46.

Sharpe D. Your Chi-Square Test Is Statistically Significant: Now What?Pract Assess Res Eval. 2015; 20(8):1–10.

Luo X, Xia H, Yang W, Wang B, Guo T, Xiong J, Jiang Z, Liu Y, Yan X, Zhou W, Ye L, Zhang B. Characteristics of patients with COVID-19 during epidemic ongoing outbreak in Wuhan, China. medRxiv. 2020:1–17. https://doi.org/10.1101/2020.03.19.20033175.

Poyiadi N, Cormier P, Patel PY, Hadied MO, Bhargava P, Khanna K, Nadig J, Keimig T, Spizarny D, Reeser N, Klochko C, Peterson EL, Song T. Acute Pulmonary Embolism and COVID-19. Radiology. 2020; 201955:1–9. https://doi.org/10.1148/radiol.2020201955.

Jacob L, Smith L, Butler L, Barnett Y, Grabovac I, McDermott D, Armstrong N, Yakkundi A, Tully MA. COVID-19 Social Distancing and Sexual Activity in a Sample of the British Public. J Sex Med. 2020; 17(7):1229–36. https://doi.org/10.1016/j.jsxm.2020.05.001.

Bearden WO, Sharma S, Teel JE. Sample Size Effects on Chi Square and Other Statistics Used in Evaluating Causal Models. J Mark Res. 1982; 19(4):425–30. https://doi.org/10.1177/002224378201900404.

Bentler PM, Bonett DG. Significance tests and goodness of fit in the analysis of covariance structures,. Psychol Bull. 1980; 88(3):588–606. https://doi.org/10.1037/0033-2909.88.3.588.

## Funding

This work was supported by JSPS KAKENHI Grant Numbers JP17H04705, JP18H03229, JP18H03340, JP18K19835, JP19K12107, JP19H04113. This work was supported by JST, PRESTO Grant Number JPMJPR1934.

## Author information

### Authors and Affiliations

### Contributions

YS contributed to the study design, the conception, data acquisition, analysis, interpretation, writing, drafting, revision, and creation of software. AO supervised the study and contributed to the study design, the conception, analysis, interpretation, writing. All authors read and approved the final manuscript.

### Corresponding author

## Ethics declarations

### Ethics approval and consent to participate

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

## Additional information

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

## About this article

### Cite this article

Sei, Y., Ohsuga, A. Privacy-preserving chi-squared test of independence for small samples.
*BioData Mining* **14**, 6 (2021). https://doi.org/10.1186/s13040-021-00238-x

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/s13040-021-00238-x

### Keywords

- Differentical privacy
- Chi-squared testing
- Privacy-preserving data mining