Independence¶

Distance Correlation (Dcorr)¶

class mgc.independence.Dcorr(compute_distance=<function euclidean>)[source]¶

Class for calculating the Dcorr test statistic and p-value.

Dcorr is a measure of dependence between two paired random matrices of not necessarily equal dimensions. The coefficient is 0 if and only if the matrices are independent. It is an example of an energy distance.

Parameters:

compute_distance : callable(), optional (default: euclidean)

A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_distance(x) where x is the data matrix for which pairwise distances are calculated.

See also

Hsic: Hilbert-Schmidt independence criterion test statistic and p-value.
HHG: Heller Heller Gorfine test statistic and p-value.

Notes

The statistic can be derived as follows:

Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). Let \(D^x\) be the \(n \times n\) distance matrix of \(x\) and \(D^y\) be the \(n \times n\) be the distance matrix of \(y\). The distance covariance is,

\[\mathrm{Dcov}_n (x, y) = \frac{1}{n^2} \mathrm{tr} (D^x H D^y H)\]

where \(\mathrm{tr} (\cdot)\) is the trace operator and \(H\) is defined as \(H = I - (1/n) J\) where \(I\) is the identity matrix and \(J\) is a matrix of ones. The normalized version of this covariance is Dcorr [1] and is

\[\mathrm{Dcorr}_n (x, y) = \frac{\mathrm{Dcov}_n (x, y)} {\sqrt{\mathrm{Dcov}_n (x, x) \mathrm{Dcov}_n (y, y)}}\]

This version of distance correlation is defined using the following centering process where \(\mathbb{1}(\cdot)\) is the indicator function:

\[C^x_{ij} = \left[ D^x_{ij} - \frac{1}{n-2} \sum_{t=1}^n D^x_{it} - \frac{1}{n-2} \sum_{s=1}^n D^x_{sj} + \frac{1}{(n-1) (n-2)} \sum_{s,t=1}^n D^x_{st} \right] \mathbb{1}_{i \neq j}\]

and similarly for \(C^y\). Then, this unbiased Dcorr is,

\[\mathrm{UDcov}_n (x, y) = \frac{1}{n (n-3)} \mathrm{tr} (C^x C^y)\]

The normalized version of this covariance [2] is

\[\mathrm{UDcorr}_n (x, y) = \frac{\mathrm{UDcov}_n (x, y)} {\sqrt{\mathrm{UDcov}_n (x, x) \mathrm{UDcov}_n (y, y)}}\]

References

[1]	(1, 2) Székely, G. J., Rizzo, M. L., & Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6), 2769-2794.

[2]	(1, 2) Székely, G. J., & Rizzo, M. L. (2014). Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6), 2382-2412.

test(x, y, reps=1000, workers=1, random_state=None)[source]¶

Calculates the Dcorr test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

random_state : int or np.random.RandomState instance, (default: None)

If already a RandomState instance, use it. If seed is an int, return a new RandomState instance seeded with seed. If None, use np.random.RandomState.

Returns:

stat : float

The computed Dcorr statistic.

pvalue : float

The computed Dcorr p-value.

Examples

>>> import numpy as np
>>> from mgc.independence import Dcorr
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Dcorr().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from mgc.independence import Dcorr
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Dcorr().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_distance parameter must be set to None.

>>> import numpy as np
>>> from mgc.independence import Dcorr
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> dcorr = Dcorr(compute_distance=None)
>>> stat, pvalue = dcorr.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'

Hilbert Schmidt Independence Criterion (Hsic)¶

class mgc.independence.Hsic(compute_kernel=<function gaussian>)[source]¶

Class for calculating the Hsic test statistic and p-value.

Hsic is a kernel based independence test and is a way to measure multivariate nonlinear associations given a specified kernel [3]. The default choice is the Gaussian kernel, which uses the median distance as the bandwidth, which is a characteristic kernel that guarantees that Hsic is a consistent test [3] [4].

Parameters:

compute_kernel : callable(), optional (default: rbf kernel)

A function that computes the similarity among the samples within each data matrix. Set to None if x and y are already similarity matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_kernel(x) where x is the data matrix for which pairwise similarties are calculated.

See also

Dcorr: Distance correlation test statistic and p-value.
HHG: Heller Heller Gorfine test statistic and p-value.

Notes

The statistic can be derived as follows [3]:

Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). Let \(K^x\) be the \(n \times n\) kernel similarity matrix of \(x\) and \(D^y\) be the \(n \times n\) be the kernel similarity matrix of \(y\). The Hsic statistic is,

\[\mathrm{Hsic}_n (x, y) = \frac{1}{n^2} \mathrm{tr} (K^x H K^y H)\]

where \(\mathrm{tr} (\cdot)\) is the trace operator and \(H\) is defined as \(H = I - (1/n) J\) where \(I\) is the identity matrix and \(J\) is a matrix of ones. The normalized version of Hsic [1] and is

\[\mathrm{Hsic}_n (x, y) = \frac{\mathrm{Hsic}_n (x, y)} {\sqrt{\mathrm{Hsic}_n (x, x) \mathrm{Hsic}_n (y, y)}}\]

This version of Hsic is defined using the following centering process where \(\mathbb{1}(\cdot)\) is the indicator function:

\[C^x_{ij} = \left[ D^x_{ij} - \frac{1}{n-2} \sum_{t=1}^n D^x_{it} - \frac{1}{n-2} \sum_{s=1}^n D^x_{sj} + \frac{1}{(n-1) (n-2)} \sum_{s,t=1}^n D^x_{st} \right] \mathbb{1}_{i \neq j}\]

and similarly for \(C^y\). Then, this unbiased Dcorr is,

\[\mathrm{UHsic}_n (x, y) = \frac{1}{n (n-3)} \mathrm{tr} (C^x C^y)\]

The normalized version of this covariance [2] is

\[\mathrm{UHsic}_n (x, y) = \frac{\mathrm{UHsic}_n (x, y)} {\sqrt{\mathrm{UHsic}_n (x, x) \mathrm{UHsic}_n (y, y)}}\]

References

[3]	(1, 2, 3) Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., & Smola, A. J. (2008). A kernel statistical test of independence. In Advances in neural information processing systems (pp. 585-592).

[4]	Gretton, A., & GyĂśrfi, L. (2010). Consistent nonparametric tests of independence. Journal of Machine Learning Research, 11(Apr), 1391-1423.

test(x, y, reps=1000, workers=1, random_state=None)[source]¶

Calculates the Hsic test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

random_state : int or np.random.RandomState instance, (default: None)

If already a RandomState instance, use it. If seed is an int, return a new RandomState instance seeded with seed. If None, use np.random.RandomState.

Returns:

stat : float

The computed Hsic statistic.

pvalue : float

The computed Hsic p-value.

Examples

>>> import numpy as np
>>> from mgc.independence import Hsic
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Hsic().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from mgc.independence import Hsic
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Hsic().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_kernel parameter must be set to None.

>>> import numpy as np
>>> from mgc.independence import Hsic
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> hsic = Hsic(compute_kernel=None)
>>> stat, pvalue = hsic.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'

Heller Heller Gorfine (HHG)¶

class mgc.independence.HHG(compute_distance=<function euclidean>)[source]¶

Class for calculating the HHG test statistic and p-value.

This is a powerful test for independence based on calculating pairwise euclidean distances and associations between these distance matrices. The test statistic is a function of ranks of these distances, and is consistent against similar tests [5]. It can also operate on multiple dimensions [5].

Parameters:

compute_distance : callable(), optional (default: euclidean)

A function that computes the distance among the samples within each data matrix. Set to None if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form compute_distance(x) where x is the data matrix for which pairwise distances are calculated.

See also

Dcorr: Distance correlation test statistic and p-value.
Hsic: Hilbert-Schmidt independence criterion test statistic and p-value.

Notes

The statistic can be derived as follows [5]:

Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). For every sample \(j \neq i\), calculate the pairwise distances in \(x\) and \(y\) and denote this as \(d_x(x_i, x_j)\) and \(d_y(y_i, y_j)\). The indicator function is denoted as \(\mathbb{1} \{ \cdot \}\). The cross-classification between these two random variables can be calculated as

\[A_{11} = \sum_{k=1, k \neq i,j}^n \mathbb{1} \{ d_x(x_i, x_k) \leq d_x(x_i, x_j) \} \mathbb{1} \{ d_y(y_i, y_k) \leq d_y(y_i, y_j) \}\]

and \(A_{12}\), \(A_{21}\), and \(A_{22}\) are defined similarly. This is organized within the following table:

	\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\)	\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\)
\(d_x(x_i, \cdot) \leq d_x(x_i, x_j)\)	\(A_{11} (i,j)\)	\(A_{12} (i,j)\)	\(A_{1 \cdot} (i,j)\)
\(d_x(x_i, \cdot) > d_x(x_i, x_j)\)	\(A_{21} (i,j)\)	\(A_{22} (i,j)\)	\(A_{2 \cdot} (i,j)\)
	\(A_{\cdot 1} (i,j)\)	\(A_{\cdot 2} (i,j)\)	\(n - 2\)

Here, \(A_{\cdot 1}\) and \(A_{\cdot 2}\) are the column sums, \(A_{1 \cdot}\) and \(A_{2 \cdot}\) are the row sums, and \(n - 2\) is the number of degrees of freedom. From this table, we can calculate the Pearson's chi squared test statistic using,

\[S(i, j) = \frac{(n-2) (A_{12} A_{21} - A_{11} A_{22})^2} {A_{1 \cdot} A_{2 \cdot} A_{\cdot 1} A_{\cdot 2}}\]

and the HHG test statistic is then,

\[\mathrm{HHG}_n (x, y) = \sum_{i=1}^n \sum_{j=1, j \neq i}^n S(i, j)\]

References

[5]	(1, 2, 3) Heller, R., Heller, Y., & Gorfine, M. (2012). A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2), 503-510.

test(x, y, reps=1000, workers=1, random_state=None)[source]¶

Calculates the HHG test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

random_state : int or np.random.RandomState instance, (default: None)

If already a RandomState instance, use it. If seed is an int, return a new RandomState instance seeded with seed. If None, use np.random.RandomState.

Returns:

stat : float

The computed HHG statistic.

pvalue : float

The computed HHG p-value.

Examples

>>> import numpy as np
>>> from mgc.independence import HHG
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = HHG().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'160.0, 0.00'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from mgc.independence import HHG
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = HHG().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'160.0, 0.00'

In addition, the inputs can be distance matrices. Using this is the, same as before, except the compute_distance parameter must be set to None.

>>> import numpy as np
>>> from mgc.independence import HHG
>>> x = np.ones((10, 10)) - np.identity(10)
>>> y = 2 * x
>>> hhg = HHG(compute_distance=None)
>>> stat, pvalue = hhg.test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'0.0, 1.00'

Cannonical Correlation Analysis (CCA)¶

class mgc.independence.CCA[source]¶

Class for calculating the CCA test statistic and p-value.

This test can be thought of inferring information from cross-covariance matrices [6]. It has been thought that virtually all parametric tests of significance can be treated as a special case of CCA [7]. The method was first introduced by Harold Hotelling in 1936 [8].

See also

Pearson: Pearson product-moment correlation test statistic and p-value.
RV: RV test statistic and p-value.

Notes

The statistic can be derived as follows [9]:

Let \(x\) and \(y\) be :math:`(n, p) samples of random variables \(X\) and \(Y\). We can center \(x\) and \(y\) and then calculate the sample covariance matrix \(\hat{\Sigma}_{xy} = x^T y\) and the variance matrices for \(x\) and \(y\) are defined similarly. Then, the CCA test statistic is found by calculating vectors \(a \in \mathbb{R}^p\) and \(b \in \mathbb{R}^q\) that maximize

\[\mathrm{CCA}_n (x, y) = \max_{a \in \mathbb{R}^p, b \in \mathbb{R}^q} \frac{a^T \hat{\Sigma}_{xy} b} {\sqrt{a^T \hat{\Sigma}_{xx} a} \sqrt{b^T \hat{\Sigma}_{yy} b}}\]

References

[6]	Härdle, W. K., & Simar, L. (2015). Canonical correlation analysis. In Applied multivariate statistical analysis (pp. 443-454). Springer, Berlin, Heidelberg.

[7]	Knapp, T. R. (1978). Canonical correlation analysis: A general parametric significance-testing system. Psychological Bulletin, 85(2), 410.

[8]	Hotelling, H. (1992). Relations between two sets of variates. In Breakthroughs in statistics (pp. 162-190). Springer, New York, NY.

[9]	Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12), 2639-2664.

test(x, y, reps=1000, workers=1, random_state=None)[source]¶

Calculates the CCA test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions.

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

random_state : int or np.random.RandomState instance, (default: None)

If already a RandomState instance, use it. If seed is an int, return a new RandomState instance seeded with seed. If None, use np.random.RandomState.

Returns:

stat : float

The computed CCA statistic.

pvalue : float

The computed CCA p-value.

Examples

>>> import numpy as np
>>> from mgc.independence import CCA
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = CCA().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from mgc.independence import CCA
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = CCA().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

RV¶

class mgc.independence.RV[source]¶

Class for calculating the RV test statistic and p-value.

RV is the multivariate generalization of the squared Pearson correlation coefficient [10]. The RV coefficient can be thought to be closely related to principal component analysis (PCA), canonical correlation analysis (CCA), multivariate regression, and statistical classification [10].

See also

Pearson: Pearson product-moment correlation test statistic and p-value.
CCA: CCA test statistic and p-value.

Notes

The statistic can be derived as follows [10] [11]:

Let \(x\) and \(y\) be \((n, p)\) samples of random variables \(X\) and \(Y\). We can center \(x\) and \(y\) and then calculate the sample covariance matrix \(\hat{\Sigma}_{xy} = x^T y\) and the variance matrices for \(x\) and \(y\) are defined similarly. Then, the RV test statistic is found by calculating

\[\mathrm{RV}_n (x, y) = \frac{\mathrm{tr} \left( \hat{\Sigma}_{xy} \hat{\Sigma}_{yx} \right)} {\mathrm{tr} \left( \hat{\Sigma}_{xx}^2 \right) \mathrm{tr} \left( \hat{\Sigma}_{yy}^2 \right)}\]

where \(\mathrm{tr} (\cdot)\) is the trace operator.

References

[10]	(1, 2, 3) Robert, P., & Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the RV‐coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics), 25(3), 257-265.

[11]	Escoufier, Y. (1973). Le traitement des variables vectorielles. Biometrics, 751-760.

test(x, y, reps=1000, workers=1, random_state=None)[source]¶

Calculates the RV test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions.

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

random_state : int or np.random.RandomState instance, (default: None)

If already a RandomState instance, use it. If seed is an int, return a new RandomState instance seeded with seed. If None, use np.random.RandomState.

Returns:

stat : float

The computed RV statistic.

pvalue : float

The computed RV p-value.

Examples

>>> import numpy as np
>>> from mgc.independence import RV
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = RV().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

The number of replications can give p-values with higher confidence (greater alpha levels).

>>> import numpy as np
>>> from mgc.independence import RV
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = RV().test(x, y, reps=10000)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

Pearson¶

class mgc.independence.Pearson[source]¶

Class for calculating the Pearson test statistic and p-value.

Pearson product-moment correlation coefficient is a measure of the linear correlation between two random variables [12]. It has a value between +1 and -1 where 1 is the total positive linear correlation, 0 is not linear correlation, and -1 is total negative correlation.

See also

RV: RV test statistic and p-value.
CCA: CCA test statistic and p-value.
Spearman: Spearman's rho test statistic and p-value.
Kendall: Kendall's tau test statistic and p-value.

Notes

This class is a wrapper of scipy.stats.pearsonr. The statistic can be derived as follows [12]:

Let \(x\) and \(y\) be \((n, 1)\) samples of random variables \(X\) and \(Y\). Let \(\hat{\mathrm{cov}} (x, y)\) is the sample covariance, and \(\hat{\sigma}_x\) and \(\hat{\sigma}_y\) are the sample variances for \(x\) and \(y\). Then, the Pearson's correlation coefficient is,

\[\mathrm{Pearson}_n (x, y) = \frac{\hat{\mathrm{cov}} (x, y)} {\hat{\sigma}_x \hat{\sigma}_y}\]

References

[12]	(1, 2) Pearson, K. (1895). VII. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58(347-352), 240-242.

test(x, y)[source]¶

Calculates the Pearson test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, 1) where n is the number of samples.

Returns:

stat : float

The computed Pearson statistic.

pvalue : float

The computed Pearson p-value.

Examples

>>> import numpy as np
>>> from mgc.independence import Pearson
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Pearson().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

Kendall's tau¶

class mgc.independence.Kendall[source]¶

Class for calculating the Kendall's \(\tau\) test statistic and p-value.

Kendall's \(\tau\) coefficient is a statistic to meassure ordinal associations between two quantities. The Kendall's \(\tau\) correlation between high when variables similar rank relative to other observations [13]. Both this and the closely related Spearman's \(\rho\) coefficient are special cases of a general correlation coefficient.

See also

Pearson: Pearson product-moment correlation test statistic and p-value.
Spearman: Spearman's rho test statistic and p-value.

Notes

This class is a wrapper of scipy.stats.kendalltau. The statistic can be derived as follows [13]:

Let \(x\) and \(y\) be \((n, 1)\) samples of random variables \(X\) and \(Y\). Define \((x_i, y_i)\) and \((x_j, y_j)\) as concordant if the ranks agree: \(x_i > x_j\) and \(y_i > y_j\) or x_i > x_j and \(y_i < y_j\). They are discordant if the ranks disagree: \(x_i > x_j\) and \(y_i < y_j\) or \(x_i < x_j\) and \(y_i > y_j\). If \(x_i > x_j\) and \(y_i < y_j\), the pair is said to be tied. Let \(n_c\) and \(n_d\) be the number of concordant and discordant pairs respectively and \(n_0 = n(n-1) / 2\). In the case of no ties, the test statistic is defined as

\[\mathrm{Kendall}_n (x, y) = \frac{n_c - n_d}{n_0}\]

Further, define \(n_1 = \sum_i \frac{t_i (t_i - 1)}{2}\), \(n_2 = \sum_j \frac{u_j (u_j - 1)}{2}\), \(t_i\) be the number of tied values in the \(i\) be the number of tied values in the :math:`j`th group. Then, the statistic is [14],

\[\mathrm{Kendall}_n (x, y) = \frac{n_c - n_d} {\sqrt{(n_0 - n_1) (n_0 - n_2)}}\]

References

[13]	(1, 2) Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81-93.

[14]	Agresti, A. (2010). Analysis of ordinal categorical data (Vol. 656). John Wiley & Sons.

test(x, y)[source]¶

Calculates the Kendall's \(\tau\) test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, 1) where n is the number of samples.

Returns:

stat : float

The computed Kendall's tau statistic.

pvalue : float

The computed Kendall's tau p-value.

Examples

>>> import numpy as np
>>> from mgc.independence import Kendall
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Kendall().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'

Spearman's rho¶

class mgc.independence.Spearman[source]¶

Class for calculating the Spearman's \(\rho\) test statistic and p-value.

Spearman's \(\rho\) coefficient is a nonparametric measure or rank correlation between two variables. It is equivalent to the Pearson's correlation with ranks.

See also

Pearson: Pearson product-moment correlation test statistic and p-value.
Kendall: Kendall's tau test statistic and p-value.

Notes

This class is a wrapper of scipy.stats.spearmanr. The statistic can be derived as follows [15]:

Let \(x\) and \(y\) be \((n, 1)\) samples of random variables \(X\) and \(Y\). Let \(rg_x\) and \(rg_y\) are the \(n\) raw scores. Let \(\hat{\mathrm{cov}} (rg_x, rg_y)\) is the sample covariance, and \(\hat{\sigma}_{rg_x}\) and \(\hat{\sigma}_{rg_x}\) are the sample variances of the rank variables. Then, the Spearman's \(\rho\) coefficient is,

\[\mathrm{Spearman}_n (x, y) = \frac{\hat{\mathrm{cov}} (rg_x, rg_y)} {\hat{\sigma}_{rg_x} \hat{\sigma}_{rg_y}}\]

References

[15]	Myers, J. L., Well, A. D., & Lorch Jr, R. F. (2013). Research design and statistical analysis. Routledge.

test(x, y)[source]¶

Calculates the Spearman's \(\rho\) test statistic and p-value.

Parameters:

x, y : ndarray

Input data matrices. x and y must have the same number of samples and dimensions. That is, the shapes must be (n, 1) where n is the number of samples.

Returns:

stat : float

The computed Spearman's rho statistic.

pvalue : float

The computed Spearman's rho p-value.

Examples

>>> import numpy as np
>>> from mgc.independence import Spearman
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Spearman().test(x, y)
>>> '%.1f, %.2f' % (stat, pvalue)
'1.0, 0.00'