Package 'semidist'

Title: Measure Dependence Between Categorical and Continuous Variables
Description: Semi-distance and mean-variance (MV) index are proposed to measure the dependence between a categorical random variable and a continuous variable. Test of independence and feature screening for classification problems can be implemented via the two dependence measures. For the details of the methods, see Zhong et al. (2023) <doi:10.1080/01621459.2023.2284988>; Cui and Zhong (2019) <doi:10.1016/j.csda.2019.05.004>; Cui, Li and Zhong (2015) <doi:10.1080/01621459.2014.920256>.
Authors: Wei Zhong [aut], Zhuoxi Li [aut, cre, cph], Wenwen Guo [aut], Hengjian Cui [aut], Runze Li [aut]
Maintainer: Zhuoxi Li <[email protected]>
License: MIT + file LICENSE
Version: 0.1.0
Built: 2024-11-22 04:21:23 UTC
Source: https://github.com/wzhong41/semidist

Help Index


Mutual information independence test (categorical-continuous case)

Description

Implement the mutual information independence test (MINT) (Berrett and Samworth, 2019), but with some modification in estimating the mutual informaion (MI) between a categorical random variable and a continuous variable. The modification is based on the idea of Ross (2014).

MINTsemiperm() implements the permutation independence test via mutual information, but the parameter k should be pre-specified.

MINTsemiauto() automatically selects an appropriate k based on a data-driven procedure, and conducts MINTsemiperm() with the k chosen.

Usage

MINTsemiperm(X, y, k, B = 1000)

MINTsemiauto(X, y, kmax, B1 = 1000, B2 = 1000)

Arguments

X

Data of multivariate continuous variables, which should be an nn-by-pp matrix, or, a vector of length nn (for univariate variable).

y

Data of categorical variables, which should be a factor of length nn.

k

Number of nearest neighbor. See References for details.

B, B1, B2

Number of permutations to use. Defaults to 1000.

kmax

Maximum k in the automatic search for optimal k.

Value

A list with class "indtest" containing the following components

  • method: name of the test;

  • name_data: names of the X and y;

  • n: sample size of the data;

  • num_perm: number of replications in permutation test;

  • stat: test statistic;

  • pvalue: computed p-value.

For MINTsemiauto(), the list also contains

  • kmax: maximum k in the automatic search for optimal k;

  • kopt: optimal k chosen.

References

  1. Berrett, Thomas B., and Richard J. Samworth. "Nonparametric independence testing via mutual information." Biometrika 106, no. 3 (2019): 547-566.

  2. Ross, Brian C. "Mutual information between discrete and continuous data sets." PloS one 9, no. 2 (2014): e87357.

Examples

X <- mtcars[, c("mpg", "disp", "drat", "wt")]
y <- factor(mtcars[, "am"])

MINTsemiperm(X, y, 5)
MINTsemiauto(X, y, kmax = 32)

Mean Variance (MV) statistics

Description

Compute the statistics of mean variance (MV) index, which can measure the dependence between a univariate continuous variable and a categorical variable. See Cui, Li and Zhong (2015); Cui and Zhong (2019) for details.

Usage

mv(x, y, return_mat = FALSE)

Arguments

x

Data of univariate continuous variables, which should be a vector of length nn.

y

Data of categorical variables, which should be a factor of length nn.

return_mat

A boolean. If FALSE (the default), only the calculated statistic is returned. If TRUE, also return the matrix of the indicator for x <= x_i, which is useful for the permutation test.

Value

The value of the corresponding sample statistic.

If the argument return_mat of mv() is set as TRUE, a list with elements

  • mv: the MV index statistic;

  • mat_x: the matrices of the distances of the indicator for x <= x_i;

will be returned.

See Also

  • mv_test() for implementing independence test via MV index;

  • mv_sis() for implementing feature screening via MV index.

Examples

x <- mtcars[, "mpg"]
y <- factor(mtcars[, "am"])
print(mv(x, y))

# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; prob <- rep(1/R, R)
x <- rnorm(n)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
print(mv(x, y))

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3
x <- rep(0, n)
x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1
y <- factor(rep(1:3, each = 10))
print(mv(x, y))

Feature screening via MV Index

Description

Implement the feature screening for the classification problem via MV index.

Usage

mv_sis(X, y, d = NULL, parallel = FALSE)

Arguments

X

Data of multivariate covariates, which should be an nn-by-pp matrix.

y

Data of categorical response, which should be a factor of length nn.

d

An integer specifying how many features should be kept after screening. Defaults to NULL. If NULL, then it will be set as [n/log(n)][n / log(n)], where [x][x] denotes the integer part of x.

parallel

A boolean indicating whether to calculate parallelly via furrr::future_map. Defaults to FALSE.

Value

A list of the objects about the implemented feature screening:

  • measurement: sample MV index calculated for each single covariate;

  • selected: indicies or names (if avaiable as colnames of X) of covariates that are selected after feature screening;

  • ordering: order of the calculated measurements of each single covariate. The first one is the largest, and the last is the smallest.

Examples

X <- mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")]
y <- factor(mtcars[, "am"])

mv_sis(X, y, d = 4)

MV independence test

Description

Implement the MV independence test via permutation test, or via the asymptotic approximation

Usage

mv_test(x, y, test_type = "perm", num_perm = 10000)

Arguments

x

Data of univariate continuous variables, which should be a vector of length nn.

y

Data of categorical variables, which should be a factor of length nn.

test_type

Type of the test:

  • "perm" (the default): Implement the test via permutation test;

  • "asym": Implement the test via the asymptotic approximation.

See the Reference for details.

num_perm

The number of replications in permutation test.

Value

A list with class "indtest" containing the following components

  • method: name of the test;

  • name_data: names of the x and y;

  • n: sample size of the data;

  • num_perm: number of replications in permutation test;

  • stat: test statistic;

  • pvalue: computed p-value. (Notice: asymptotic test cannot return a p-value, but only the critical values crit_vals for 90%, 95% and 99% confidence levels.)

Examples

x <- mtcars[, "mpg"]
y <- factor(mtcars[, "am"])
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)

# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; prob <- rep(1/R, R)
x <- rnorm(n)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3
x <- rep(0, n)
x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1
y <- factor(rep(1:3, each = 10))
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)

Print Method for Independence Tests Between Categorical and Continuous Variables

Description

Printing object of class "indtest", by simple print method.

Usage

## S3 method for class 'indtest'
print(x, digits = getOption("digits"), ...)

Arguments

x

"indtest" class object.

digits

minimal number of significant digits.

...

further arguments passed to or from other methods.

Value

None

Examples

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3
x <- rep(0, n)
x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1
y <- factor(rep(1:3, each = 10))
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)

Feature screening via semi-distance correlation

Description

Implement the (grouped) feature screening for the classification problem via semi-distance correlation.

Usage

sd_sis(X, y, group_info = NULL, d = NULL, parallel = FALSE)

Arguments

X

Data of multivariate covariates, which should be an nn-by-pp matrix.

y

Data of categorical response, which should be a factor of length nn.

group_info

A list specifying the group information, with elements being sets of indicies of covariates in a same group. For example, list(c(1, 2, 3), c(4, 5)) specifies that covariates 1, 2, 3 are in a group and covariates 4, 5 are in another group.

Defaults to NULL. If NULL, then it will be set as list(1, 2, ..., p), that is, treat each single covariate as a group.

If X has colnames, then the colnames can be used to specified the group_info. For example, list(c("a", "b"), c("c", "d")).

The names of the list can help recoginize the group. For example, list(grp_ab = c("a", "b"), grp_cd = c("c", "d")). If names of the list are not specified, c("Grp 1", "Grp 2", ..., "Grp J") will be applied.

d

An integer specifying at least how many (single) features should be kept after screening. For example, if group_info = list(c(1, 2), c(3, 4)) and d = 3, then all features 1, 2, 3, 4 must be selected since it should guarantee at least 3 features are kept.

Defaults to NULL. If NULL, then it will be set as [n/log(n)][n / log(n)], where [x][x] denotes the integer part of x.

parallel

A boolean indicating whether to calculate parallelly via furrr::future_map. Defaults to FALSE.

Value

A list of the objects about the implemented feature screening:

  • group_info: group information;

  • measurement: sample semi-distance correlations calculated for the groups specified in group_info;

  • selected: indicies/names of (single) covariates that are selected after feature screening;

  • ordering: order of the calculated measurements of the groups specified in group_info. The first one is the largest, and the last is the smallest.

See Also

sdcor() for calculating the sample semi-distance correlation.

Examples

X <- mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")]
y <- factor(mtcars[, "am"])

sd_sis(X, y, d = 4)

# Suppose we have prior information for the group structure as
# ("mpg", "drat"), ("disp", "hp") and ("wt", "qsec")
group_info <- list(
  mpg_drat = c("mpg", "drat"),
  disp_hp = c("disp", "hp"),
  wt_qsec = c("wt", "qsec")
)
sd_sis(X, y, group_info, d = 4)

Semi-distance independence test

Description

Implement the semi-distance independence test via permutation test, or via the asymptotic approximation when the dimensionality of continuous variables pp is high.

Usage

sd_test(X, y, test_type = "perm", num_perm = 10000)

Arguments

X

Data of multivariate continuous variables, which should be an nn-by-pp matrix, or, a vector of length nn (for univariate variable).

y

Data of categorical variables, which should be a factor of length nn.

test_type

Type of the test:

  • "perm" (the default): Implement the test via permutation test;

  • "asym": Implement the test via the asymptotic approximation when the dimension of continuous variables pp is high.

See the Reference for details.

num_perm

The number of replications in permutation test. Defaults to 10000. See Details and Reference.

Details

The semi-distance independence test statistic is

Tn=nSDcov~n(X,y),T_n = n \cdot \widetilde{\text{SDcov}}_n(X, y),

where the SDcov~n(X,y)\widetilde{\text{SDcov}}_n(X, y) can be computed by sdcov(X, y, type = "U").

For the permutation test (test_type = "perm"), totally KK replications of permutation will be conducted, and the argument num_perm specifies the KK here. The p-value of permutation test is computed by

p-value=(k=1KI(Tn(k)Tn)+1)/(K+1),\text{p-value} = (\sum_{k=1}^K I(T^{\ast (k)}_{n} \ge T_{n}) + 1) / (K + 1),

where TnT_{n} is the semi-distance test statistic and Tn(k)T^{\ast (k)}_{n} is the test statistic with kk-th permutation sample.

When the dimension of the continuous variables is high, the asymptotic approximation approach can be applied (test_type = "asym"), which is computationally faster since no permutation is needed.

Value

A list with class "indtest" containing the following components

  • method: name of the test;

  • name_data: names of the X and y;

  • n: sample size of the data;

  • test_type: type of the test;

  • num_perm: number of replications in permutation test, if test_type = "perm";

  • stat: test statistic;

  • pvalue: computed p-value.

See Also

sdcov() for computing the statistic of semi-distance covariance.

Examples

X <- mtcars[, c("mpg", "disp", "drat", "wt")]
y <- factor(mtcars[, "am"])
test <- sd_test(X, y)
print(test)

# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R)
X <- matrix(rnorm(n*p), n, p)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
test <- sd_test(X, y)
print(test)

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3; p <- 3
X <- matrix(0, n, p)
X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1
y <- factor(rep(1:3, each = 10))
test <- sd_test(X, y)
print(test)

#' Man-made high-dimensionally independent data -----------------------------
n <- 30; R <- 3; p <- 100
X <- matrix(rnorm(n*p), n, p)
y <- factor(rep(1:3, each = 10))
test <- sd_test(X, y)
print(test)

test <- sd_test(X, y, test_type = "asym")
print(test)

# Man-made high-dimensionally dependent data --------------------------------
n <- 30; R <- 3; p <- 100
X <- matrix(0, n, p)
X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1
y <- factor(rep(1:3, each = 10))
test <- sd_test(X, y)
print(test)

test <- sd_test(X, y, test_type = "asym")
print(test)

Semi-distance covariance and correlation statistics

Description

Compute the statistics (or sample estimates) of semi-distance covariance and correlation. The semi-distance correlation is a standardized version of semi-distance covariance, and it can measure the dependence between a multivariate continuous variable and a categorical variable. See Details for the definition of semi-distance covariance and semi-distance correlation.

Usage

sdcov(X, y, type = "V", return_mat = FALSE)

sdcor(X, y)

Arguments

X

Data of multivariate continuous variables, which should be an nn-by-pp matrix, or, a vector of length nn (for univariate variable).

y

Data of categorical variables, which should be a factor of length nn.

type

Type of statistic: "V" (the default) or "U". See Details.

return_mat

A boolean. If FALSE (the default), only the calculated statistic is returned. If TRUE, also return the matrix of the distances of X and the divergences of y, which is useful for the permutation test.

Details

For XRp\bm{X} \in \mathbb{R}^{p} and Y{1,2,,R}Y \in \{1, 2, \cdots, R\}, the (population-level) semi-distance covariance is defined as

SDcov(X,Y)=E[XX~(1r=1RI(Y=r,Y~=r)/pr)],\mathrm{SDcov}(\bm{X}, Y) = \mathrm{E}\left[\|\bm{X}-\widetilde{\bm{X}}\|\left(1-\sum_{r=1}^R I(Y=r,\widetilde{Y}=r)/p_r\right)\right],

where pr=P(Y=r)p_r = P(Y = r) and (X~,Y~)(\widetilde{\bm{X}}, \widetilde{Y}) is an iid copy of (X,Y)(\bm{X}, Y). The (population-level) semi-distance correlation is defined as

SDcor(X,Y)=SDcov(X,Y)dvar(X)R1,\mathrm{SDcor}(\bm{X}, Y) = \dfrac{\mathrm{SDcov}(\bm{X}, Y)}{\mathrm{dvar}(\bm{X})\sqrt{R-1}},

where dvar(X)\mathrm{dvar}(\bm{X}) is the distance variance (Szekely, Rizzo, and Bakirov 2007) of X\bm{X}.

With nn observations {(Xi,Yi)}i=1n\{(\bm{X}_i, Y_i)\}_{i=1}^{n}, sdcov() and sdcor() can compute the sample estimates for the semi-distance covariance and correlation.

If type = "V", the semi-distance covariance statistic is computed as a V-statistic, which takes a very similar form as the energy-based statistic with double centering, and is always non-negative. Specifically,

SDcovn(X,y)=1n2k=1nl=1nAklBkl,\text{SDcov}_n(\bm{X}, y) = \frac{1}{n^2} \sum_{k=1}^{n} \sum_{l=1}^{n} A_{kl} B_{kl},

where

Akl=aklaˉk.aˉ.l+aˉ..A_{kl} = a_{kl} - \bar{a}_{k.} - \bar{a}_{.l} + \bar{a}_{..}

is the double centering (Szekely, Rizzo, and Bakirov 2007) of akl=XkXl,a_{kl} = \| \bm{X}_k - \bm{X}_l \|, and

Bkl=1r=1RI(Yk=r)I(Yl=r)/p^rB_{kl} = 1 - \sum_{r=1}^{R} I(Y_k = r) I(Y_l = r) / \hat{p}_r

with p^r=nr/n=n1i=1nI(Yi=r)\hat{p}_r = n_r / n = n^{-1}\sum_{i=1}^{n} I(Y_i = r). The semi-distance correlation statistic is

SDcorn(X,y)=SDcovn(X,y)dvarn(X)R1,\text{SDcor}_n(\bm{X}, y) = \dfrac{\text{SDcov}_n(\bm{X}, y)}{\text{dvar}_n(\bm{X})\sqrt{R - 1}},

where dvarn(X)\text{dvar}_n(\bm{X}) is the V-statistic of distance variance of X\bm{X}.

If type = "U", then the semi-distance covariance statistic is computed as an “estimated U-statistic”, which is utilized in the independence test statistic and is not necessarily non-negative. Specifically,

SDcov~n(X,y)=1n(n1)ijXiXj(1r=1RI(Yi=r)I(Yj=r)/p~r),\widetilde{\text{SDcov}}_n(\bm{X}, y) = \frac{1}{n(n-1)} \sum_{i \ne j} \| \bm{X}_i - \bm{X}_j \| \left(1 - \sum_{r=1}^{R} I(Y_i = r) I(Y_j = r) / \tilde{p}_r\right),

where p~r=(nr1)/(n1)=(n1)1(i=1nI(Yi=r)1)\tilde{p}_r = (n_r-1) / (n-1) = (n-1)^{-1}(\sum_{i=1}^{n} I(Y_i = r) - 1). Note that the test statistic of the semi-distance independence test is

Tn=nSDcov~n(X,y).T_n = n \cdot \widetilde{\text{SDcov}}_n(\bm{X}, y).

Value

The value of the corresponding sample statistic.

If the argument return_mat of sdcov() is set as TRUE, a list with elements

  • sdcov: the semi-distance covariance statistic;

  • ⁠mat_x, mat_y⁠: the matrices of the distances of X and the divergences of y, respectively;

will be returned.

See Also

  • sd_test() for implementing independence test via semi-distance covariance;

  • sd_sis() for implementing groupwise feature screening via semi-distance correlation.

Examples

X <- mtcars[, c("mpg", "disp", "drat", "wt")]
y <- factor(mtcars[, "am"])
print(sdcov(X, y))
print(sdcor(X, y))

# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R)
X <- matrix(rnorm(n*p), n, p)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
print(sdcov(X, y))
print(sdcor(X, y))

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3; p <- 3
X <- matrix(0, n, p)
X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1
y <- factor(rep(1:3, each = 10))
print(sdcov(X, y))
print(sdcor(X, y))

Switch the representation of a categorical object

Description

Categorical data with n observations and R levels can typically be represented as two forms in R: a factor with length n, or an n by K indicator matrix with elements being 0 or 1. This function is to switch the form of a categorical object from one to the another.

Usage

switch_cat_repr(obj)

Arguments

obj

an object representing categorical data, either a factor or an indicator matrix with each row representing an observation.

Value

categorical object in the another form.


Estimate the trace of the covariance matrix and its square

Description

For a design matrix X\mathbf{X}, estimate the trace of its covariance matrix Σ=cov(X)\Sigma = \mathrm{cov}(\mathbf{X}), and the square of covariance matrix Σ2\Sigma^2.

Usage

tr_estimate(X)

Arguments

X

The design matrix.

Value

A list with elements:

  • tr_S: estimate for trace of Σ\Sigma;

  • tr_S2: estimate for trace of Σ2\Sigma^2.