linear_algebra_stats

Simple Linear Algebra Functions for Statistics

Description

โ€˜broadcastโ€™ provides some simple Linear Algebra Functions for Statistics:
cinv()
sd_lc()
ecumprob()


Usage

cinv(x)

sd_lc(X, vc, bad_rp = NaN)

ecumprob(y, sim, eps = 0)

Arguments

x a real symmetric positive-definite square matrix.
X a numeric (or logical) matrix of multipliers/constants
vc the variance-covariance matrix for the (correlated) random variables.
bad_rp if vc is not a Positive (semi-) Definite matrix, give here the value to replace bad standard deviations with.

y values to estimate the cumulative probability for.
sim a matrix (or data.frame) with at least 500 columns of simulated values.
If sim is given as a dimensionless vector, it will be treated as a matrix with 1 row and length(sim) columns, and this will be noted with a message.
eps a non-negative numeric scaler smaller than 0.1, giving the cut-off value for probabilities.
Probabilities smaller than eps will be replaced with eps, and probabilities larger than 1 - eps will be replaced with 1 - eps.
Set eps = 0 to disable probability trimming.

Details

cinv()
cinv() computes the Choleski inverse of a real symmetric positive-definite square matrix.


sd_lc()
Given the linear combination X %*% b, where:

  • X is a matrix of multipliers/constants;

  • b is a vector of (correlated) random variables;

  • vc is the symmetric variance-covariance matrix for b;

sd_lc(X, vc) computes the standard deviations for the linear combination X %*% b, without making needless copies.
sd_lc(X, vc) will use much less memory than a base โ€˜Rโ€™ approach.
sd_lc(X, vc) will usually be faster than a base โ€˜Rโ€™ approach (depending on the Linear Algebra Library used for base โ€˜Rโ€™).


ecumprob()
The ecumprod(y, sim) function takes a matrix (or data.frame) of simulated values sim, and for each row i (after broadcasting), estimates the cumulative distribution function of sim[i, ], and returns the cumulative probability for y[i].

In terms of statistics, it is equivalent to the following operation for each index i:
ecdf(sim[i,])(y[i])
However, ecumprob() is much faster, and supports NAs/NaNs.

In terms of linear algebra, it is equivalent to the following broadcasted operation:
rowMeans(sim <= y)
where y and sim are broadcaster arrays.
However, ecumprob() is much more memory-efficient, supports a data.frame for sim, and has statistical safety checks.

Value

For cinv():
A matrix.

For sd_lc():
A vector of standard deviations.

For ecumprob():
A vector of cumulative probabilities.
If for any observation i (after broadcasting,) y[i] is NA/NaN or any of sim[i,] is NA/NaN, the result for i will be NA.
If zero-length y or sim is given, a zero-length numeric vector is returned.

References

John A. Rice (2007), Mathematical Statistics and Data Analysis (6th Edition)

See Also

chol, chol2inv

Examples

library("broadcast")


# variances ====
vc <- datasets::ability.cov$cov
X <- matrix(rnorm(100), 100, ncol(vc))

solve(vc)
#>              general       picture        blocks         maze       reading
#> general  0.082259001 -0.0312406436 -7.750932e-03 -0.013309494 -2.061705e-02
#> picture -0.031240644  0.2369906996 -2.484938e-02  0.017844845  8.603286e-04
#> blocks  -0.007750932 -0.0248493822  1.344272e-02 -0.012544830 -3.802671e-05
#> maze    -0.013309494  0.0178448450 -1.254483e-02  0.101625400  5.508423e-03
#> reading -0.020617049  0.0008603286 -3.802671e-05  0.005508423  5.713620e-02
#> vocab   -0.002420800  0.0019394999 -1.157864e-03 -0.002857265 -2.406969e-02
#>                vocab
#> general -0.002420800
#> picture  0.001939500
#> blocks  -0.001157864
#> maze    -0.002857265
#> reading -0.024069692
#> vocab    0.020323179
cinv(vc) # faster than `solve()`, but only works on positive definite matrices
#>              [,1]          [,2]          [,3]         [,4]          [,5]
#> [1,]  0.082259001 -0.0312406436 -7.750932e-03 -0.013309494 -2.061705e-02
#> [2,] -0.031240644  0.2369906996 -2.484938e-02  0.017844845  8.603286e-04
#> [3,] -0.007750932 -0.0248493822  1.344272e-02 -0.012544830 -3.802671e-05
#> [4,] -0.013309494  0.0178448450 -1.254483e-02  0.101625400  5.508423e-03
#> [5,] -0.020617049  0.0008603286 -3.802671e-05  0.005508423  5.713620e-02
#> [6,] -0.002420800  0.0019394999 -1.157864e-03 -0.002857265 -2.406969e-02
#>              [,6]
#> [1,] -0.002420800
#> [2,]  0.001939500
#> [3,] -0.001157864
#> [4,] -0.002857265
#> [5,] -0.024069692
#> [6,]  0.020323179
all(round(solve(vc), 6) == round(cinv(vc), 6)) # they're the same
#> [1] TRUE

sd_lc(X, vc)
#>   [1]  75.1359211  27.4386473  50.7331146  13.3192297  33.3566175   2.8891812
#>   [7]  20.5471208  21.2868429   3.3790892  24.1724684  20.2989638   2.9785054
#>  [13]  27.5281840  17.7735136  11.9377368  22.5238926  34.0795253   4.4239893
#>  [19]  61.2691719  19.6008464   9.3567278  21.3942806   6.4672993   0.1958693
#>  [25]  27.0894076  37.4187359   3.3006852  23.3661104   3.8497865   7.2027153
#>  [31]  11.9908184  51.4312903  19.1933210  14.5627906  37.7366011  19.8011184
#>  [37]  48.5855620  64.4155905   6.8994013  13.2092404  13.6456784  20.1236365
#>  [43]   7.4302461  16.6826254  17.1137022  10.0682649  68.1391616  30.4080246
#>  [49]  11.7730557  39.2535910   3.7998623  12.8598549  20.6669218 107.5367669
#>  [55]  43.2500954  31.3910870  12.4629448  17.0173415  14.8722451  15.1473380
#>  [61]   7.5268708   0.6173182  12.9086581  31.7074580   4.3694738  10.8935361
#>  [67]  10.3933230  24.1452663   2.2048462   4.3607478  27.2460422  18.6467387
#>  [73]   9.4688651  32.4696183  12.5998757   9.3846803  33.6487541  10.6138200
#>  [79]  27.3808021  42.2958340  37.7224107  49.4096424  10.0020110  15.3562417
#>  [85]  44.7684748  39.8387444  59.6049253  13.1797595   1.1564148   9.1536823
#>  [91]  21.4401950   6.3147613  12.7942113  10.6055748  55.9793081   8.1745395
#>  [97]  42.2829607  51.5504603   5.7884400   0.1824216



# ecumprob() ====

sim <- rnbinom(10 * 1e4, mu = 3, size = 2) |> matrix(10, 1e4)
y <- sample(0:9)

# vector:
pnbinom(y[1], mu = 3, size = 2) # real probability
#> [1] 0.8936243
ecumprob(y[1], sim[1, , drop = TRUE]) # approximation
#> [1] 0.9001

# matrix:
cbind(
  real = pnbinom(y, mu = 3, size = 2), # real probability
  approx = ecumprob(y, sim) # approximation
)
#>            real approx
#>  [1,] 0.8936243 0.9001
#>  [2,] 0.1600000 0.1609
#>  [3,] 0.9536426 0.9518
#>  [4,] 0.5248000 0.5268
#>  [5,] 0.6630400 0.6621
#>  [6,] 0.9697669 0.9703
#>  [7,] 0.7667200 0.7692
#>  [8,] 0.8413696 0.8466
#>  [9,] 0.9294561 0.9295
#> [10,] 0.3520000 0.3502

# data.frame:
cbind(
  real = pnbinom(y, mu = 3, size = 2), # real probability
  approx = ecumprob(y, as.data.frame(sim)) # approximation
)
#>            real approx
#>  [1,] 0.8936243 0.9001
#>  [2,] 0.1600000 0.1609
#>  [3,] 0.9536426 0.9518
#>  [4,] 0.5248000 0.5268
#>  [5,] 0.6630400 0.6621
#>  [6,] 0.9697669 0.9703
#>  [7,] 0.7667200 0.7692
#>  [8,] 0.8413696 0.8466
#>  [9,] 0.9294561 0.9295
#> [10,] 0.3520000 0.3502