Извлеките достоверные интервалы для надежных корреляций в R

В настоящее время я знаю, как использовать pbcor из WRS2пакет для извлечения надежных корреляций. Эта функция вычисляет 95% доверительные интервалы начальной загрузки вокруг предполагаемой устойчивой корреляции. Для согласованности с остальными моими анализами и рукописью мне нужно извлекать достоверные интервалы вместо доверительных интервалов.

Как я могу извлечь 95% достоверные интервалы, вместо доверительных интервалов 95%? Есть ли способ сделать это с помощью pbcor?

Мой набор данных содержит 210 наблюдений, но вот часть данных:

Individual  varA    varB
1   2.9380842   0.09896456
2   2.9380842   -1.38772037
3   -0.6879859  -2.41310243
4   -0.6879859  0.55722346
5   -2.3129564  -1.34140699
6   -2.3129564  -1.75604301
7   -0.4937431  0.78381085
8   -0.4937431  0.38320385
9   -0.8558126  0.82125672
10  -0.8558126  0.06346062
11  -0.9211026  -1.67170174

Соответствующий код:

WRS2::pbcor(data$varA, data$varB, ci=TRUE, nboot=1000, beta=0.1) 
>robust correlation coefficient: 0.275
>test statistic: 0.8582
>p-value:0.41307
>bootstrap CI: [-0.3564; 0.7792]

1 ответ

Решение

Привет, @Blundering Ecologist

Вот полный пример оценки достоверных интервалов с использованием байесовского моделирования для сравнения с надежными доверительными интервалами на основе WRS2: если вы используете set.seed, вы сможете воссоздать данные. Ваши результаты будут другими, когда вы перейдете к байесовской части.,как это должно. Мои комментарии включены в приведенный ниже код.

> ## generate data
> set.seed(123)     # for reproducibility
> x <- seq(1:25)+rnorm(25)     
> y <- seq(26:50)-7*rnorm(25)  
> y[10] <- y[10] * 2.5  # introduce outlier in 10th record
> y[20] <- y[20] * 1.5 # introduce outlier in 20th record
> 
> simdat <- cbind(data.frame(x), data.frame(y)) # create data frame
> 
> 
> ## standardize data
> library(robustHD)      # very useful functions standardize() & robStandardize()
Loading required package: ggplot2
Loading required package: perry
Loading required package: parallel
Loading required package: robustbase
> simdat$x_std <- standardize(simdat$x)     # mean and sd
> simdat$x_std_rob <- robStandardize(simdat$x)  # median and MAD
> 
> ## repeat for y
> simdat$y_std <- standardize(simdat$y)     # uses mean and sd
> simdat$y_std_rob <- robStandardize(simdat$y)  # uses median and MAD
> 
> head(simdat) # to see variable names of the standardized data
          x         y      x_std  x_std_rob        y_std   y_std_rob
1 0.4395244 12.806853 -1.7617645 -1.4269699  0.003689598  0.00000000
2 1.7698225 -3.864509 -1.5746770 -1.2805106 -1.705238038 -1.39579772
3 4.5587083  1.926388 -1.1824599 -0.9734679 -1.111631801 -0.91095903
4 4.0705084 11.966959 -1.2511183 -1.0272163 -0.082405292 -0.07031957
5 5.1292877 -3.776704 -1.1022161 -0.9106499 -1.696237444 -1.38844632
6 7.7150650  3.014750 -0.7385634 -0.6259685 -1.000067292 -0.81983669
> 
> ## get uncorrected correlation
> cor(simdat$x, simdat$y)
[1] 0.7507123
> 
> ## get boot-strapped correlation that corrects for the 2 outliers
> library(WRS2)
> corrxy <- WRS2::pbcor(simdat$y, simdat$x, ci=TRUE, nboot=2000, beta=0.1)
> corrxy
Call:
WRS2::pbcor(x = simdat$y, y = simdat$x, beta = 0.1, ci = TRUE, 
    nboot = 2000)

Robust correlation coefficient: 0.7657
Test statistic: 5.7084
p-value: 1e-05 

Bootstrap CI: [0.5113; 0.9116]  # Boot-strapped CI

> ## set up bivariate Bayesian regression without intercept
> ## so we get the pure zero-order correlation
> library(brms)
Loading required package: Rcpp
Loading 'brms' package (version 2.13.5). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').

Attaching package: ‘brms’

The following object is masked from ‘package:robustbase’:

    epilepsy

The following object is masked from ‘package:stats’:

    ar

> library(shinystan) 
> # gives a lovely visualization of the brms model fit object
Loading required package: shiny

This is shinystan version 2.5.0

> # in the formula below "y ~ 0 + x_std", 0 ensures there is no intercept
> mod1 <- brm( y_std ~ 0 + x_std, data=simdat, cores=2, chains=2)
Compiling Stan program...
Start sampling

SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.8e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)

SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.1e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.031892 seconds (Warm-up)
Chain 2:                0.025839 seconds (Sampling)
Chain 2:                0.057731 seconds (Total)
Chain 2: 
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.032274 seconds (Warm-up)
Chain 1:                0.028699 seconds (Sampling)
Chain 1:                0.060973 seconds (Total)
Chain 1: 
> summary(mod1)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y_std ~ 0 + x_std 
   Data: simdat (Number of observations: 25) 
Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 2000

Population-Level Effects: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
x_std     0.76      0.14     0.48     1.05 1.00     1187     1030

# Boot-strap CI: 0.51 to 0.91 compared to  (corrects for outliers)
# Bayesian Credible Interval: 0.48 to 1.05 (does not correct for outliers)
# Since the Boot-strap CI is within the Bayesian Credible Interval
# I would use that.
# Raw Corr: 0.75 vs Bayesian Corr: 0.76 vs Bootstrap Corr: 0.77

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.69      0.11     0.52     0.95 1.00     1345     1132

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
> # extract posterior samples of population-level effects 

> samples1 <- posterior_samples(mod1, "^b") # this data frame has all the values of correlation
> head(samples1)
    b_x_std
1 0.9093316
2 0.7281373
3 0.7207291
4 0.6822180
5 0.9747108
6 0.9653564

> samples2 <- posterior_samples(mod1, "sigma") # this data frame has all the values of variance around correlation
> head(samples2)
      sigma
1 0.7320897
2 0.7212673
3 0.6204091
4 0.7844105
5 0.9443782
6 0.7311916

> launch_shinystan(mod1) # launches in your web browser

> write.csv(samples1,"/home/Documents/Projects/Rcode/rob_corr_brms.csv", row.names = FALSE) # to do more using Excel
> write.csv(samples2,"/home/Documents/Projects/Rcode/rob_corr_var_brms.csv", row.names = FALSE) # to do more using Excel  

> # To learn more about brms see this link below

http://paul-buerkner.github.io/brms/articles/index.html

Here is the second model run with the robust standardized x & y

> mod_rob <- brm( y_std_rob ~ 0 + x_std_rob, data=simdat, cores=2, chains=2) 
Compiling Stan program...
Start sampling

SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.4e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)

SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 2).
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: 
Chain 2: Gradient evaluation took 2.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.025874 seconds (Warm-up)
Chain 1:                0.028535 seconds (Sampling)
Chain 1:                0.054409 seconds (Total)
Chain 1: 
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.025316 seconds (Warm-up)
Chain 2:                0.026648 seconds (Sampling)
Chain 2:                0.051964 seconds (Total)
Chain 2: 
> summary(mod_rob)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y_std_rob ~ 0 + x_std_rob 
   Data: simdat (Number of observations: 25) 
Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 2000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
x_std_rob     0.77      0.14     0.50     1.06 1.00     1639     1201

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.57      0.08     0.43     0.76 1.00     1314      977

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

> samples_rob <- posterior_samples(mod_rob, "^b")
> head(samples_rob)
  b_x_std_rob
1   0.8917219
2   0.6036900
3   0.9898435
4   0.6937937
5   0.7883487
6   0.8781157
> samples_rob_var <- posterior_samples(mod_rob, "sigma")
> head(samples_rob_var)
      sigma
1 0.5646454
2 0.4547035
3 0.6541133
4 0.4691680
5 0.6478816
6 0.4777489
Другие вопросы по тегам