Извлеките достоверные интервалы для надежных корреляций в R
В настоящее время я знаю, как использовать
pbcor
из
WRS2
пакет для извлечения надежных корреляций. Эта функция вычисляет 95% доверительные интервалы начальной загрузки вокруг предполагаемой устойчивой корреляции. Для согласованности с остальными моими анализами и рукописью мне нужно извлекать достоверные интервалы вместо доверительных интервалов.
Как я могу извлечь 95% достоверные интервалы, вместо доверительных интервалов 95%? Есть ли способ сделать это с помощью
pbcor
?
Мой набор данных содержит 210 наблюдений, но вот часть данных:
Individual varA varB
1 2.9380842 0.09896456
2 2.9380842 -1.38772037
3 -0.6879859 -2.41310243
4 -0.6879859 0.55722346
5 -2.3129564 -1.34140699
6 -2.3129564 -1.75604301
7 -0.4937431 0.78381085
8 -0.4937431 0.38320385
9 -0.8558126 0.82125672
10 -0.8558126 0.06346062
11 -0.9211026 -1.67170174
Соответствующий код:
WRS2::pbcor(data$varA, data$varB, ci=TRUE, nboot=1000, beta=0.1)
>robust correlation coefficient: 0.275
>test statistic: 0.8582
>p-value:0.41307
>bootstrap CI: [-0.3564; 0.7792]
1 ответ
Привет, @Blundering Ecologist
Вот полный пример оценки достоверных интервалов с использованием байесовского моделирования для сравнения с надежными доверительными интервалами на основе WRS2: если вы используете set.seed, вы сможете воссоздать данные. Ваши результаты будут другими, когда вы перейдете к байесовской части.,как это должно. Мои комментарии включены в приведенный ниже код.
> ## generate data
> set.seed(123) # for reproducibility
> x <- seq(1:25)+rnorm(25)
> y <- seq(26:50)-7*rnorm(25)
> y[10] <- y[10] * 2.5 # introduce outlier in 10th record
> y[20] <- y[20] * 1.5 # introduce outlier in 20th record
>
> simdat <- cbind(data.frame(x), data.frame(y)) # create data frame
>
>
> ## standardize data
> library(robustHD) # very useful functions standardize() & robStandardize()
Loading required package: ggplot2
Loading required package: perry
Loading required package: parallel
Loading required package: robustbase
> simdat$x_std <- standardize(simdat$x) # mean and sd
> simdat$x_std_rob <- robStandardize(simdat$x) # median and MAD
>
> ## repeat for y
> simdat$y_std <- standardize(simdat$y) # uses mean and sd
> simdat$y_std_rob <- robStandardize(simdat$y) # uses median and MAD
>
> head(simdat) # to see variable names of the standardized data
x y x_std x_std_rob y_std y_std_rob
1 0.4395244 12.806853 -1.7617645 -1.4269699 0.003689598 0.00000000
2 1.7698225 -3.864509 -1.5746770 -1.2805106 -1.705238038 -1.39579772
3 4.5587083 1.926388 -1.1824599 -0.9734679 -1.111631801 -0.91095903
4 4.0705084 11.966959 -1.2511183 -1.0272163 -0.082405292 -0.07031957
5 5.1292877 -3.776704 -1.1022161 -0.9106499 -1.696237444 -1.38844632
6 7.7150650 3.014750 -0.7385634 -0.6259685 -1.000067292 -0.81983669
>
> ## get uncorrected correlation
> cor(simdat$x, simdat$y)
[1] 0.7507123
>
> ## get boot-strapped correlation that corrects for the 2 outliers
> library(WRS2)
> corrxy <- WRS2::pbcor(simdat$y, simdat$x, ci=TRUE, nboot=2000, beta=0.1)
> corrxy
Call:
WRS2::pbcor(x = simdat$y, y = simdat$x, beta = 0.1, ci = TRUE,
nboot = 2000)
Robust correlation coefficient: 0.7657
Test statistic: 5.7084
p-value: 1e-05
Bootstrap CI: [0.5113; 0.9116] # Boot-strapped CI
> ## set up bivariate Bayesian regression without intercept
> ## so we get the pure zero-order correlation
> library(brms)
Loading required package: Rcpp
Loading 'brms' package (version 2.13.5). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').
Attaching package: ‘brms’
The following object is masked from ‘package:robustbase’:
epilepsy
The following object is masked from ‘package:stats’:
ar
> library(shinystan)
> # gives a lovely visualization of the brms model fit object
Loading required package: shiny
This is shinystan version 2.5.0
> # in the formula below "y ~ 0 + x_std", 0 ensures there is no intercept
> mod1 <- brm( y_std ~ 0 + x_std, data=simdat, cores=2, chains=2)
Compiling Stan program...
Start sampling
SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 2.8e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 2.1e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.031892 seconds (Warm-up)
Chain 2: 0.025839 seconds (Sampling)
Chain 2: 0.057731 seconds (Total)
Chain 2:
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.032274 seconds (Warm-up)
Chain 1: 0.028699 seconds (Sampling)
Chain 1: 0.060973 seconds (Total)
Chain 1:
> summary(mod1)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: y_std ~ 0 + x_std
Data: simdat (Number of observations: 25)
Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 2000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
x_std 0.76 0.14 0.48 1.05 1.00 1187 1030
# Boot-strap CI: 0.51 to 0.91 compared to (corrects for outliers)
# Bayesian Credible Interval: 0.48 to 1.05 (does not correct for outliers)
# Since the Boot-strap CI is within the Bayesian Credible Interval
# I would use that.
# Raw Corr: 0.75 vs Bayesian Corr: 0.76 vs Bootstrap Corr: 0.77
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.69 0.11 0.52 0.95 1.00 1345 1132
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
> # extract posterior samples of population-level effects
> samples1 <- posterior_samples(mod1, "^b") # this data frame has all the values of correlation
> head(samples1)
b_x_std
1 0.9093316
2 0.7281373
3 0.7207291
4 0.6822180
5 0.9747108
6 0.9653564
> samples2 <- posterior_samples(mod1, "sigma") # this data frame has all the values of variance around correlation
> head(samples2)
sigma
1 0.7320897
2 0.7212673
3 0.6204091
4 0.7844105
5 0.9443782
6 0.7311916
> launch_shinystan(mod1) # launches in your web browser
> write.csv(samples1,"/home/Documents/Projects/Rcode/rob_corr_brms.csv", row.names = FALSE) # to do more using Excel
> write.csv(samples2,"/home/Documents/Projects/Rcode/rob_corr_var_brms.csv", row.names = FALSE) # to do more using Excel
> # To learn more about brms see this link below
http://paul-buerkner.github.io/brms/articles/index.html
Here is the second model run with the robust standardized x & y
> mod_rob <- brm( y_std_rob ~ 0 + x_std_rob, data=simdat, cores=2, chains=2)
Compiling Stan program...
Start sampling
SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 2.4e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
SAMPLING FOR MODEL '9faff91dfca8b644fd3fe4e0f6965785' NOW (CHAIN 2).
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2:
Chain 2: Gradient evaluation took 2.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.025874 seconds (Warm-up)
Chain 1: 0.028535 seconds (Sampling)
Chain 1: 0.054409 seconds (Total)
Chain 1:
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.025316 seconds (Warm-up)
Chain 2: 0.026648 seconds (Sampling)
Chain 2: 0.051964 seconds (Total)
Chain 2:
> summary(mod_rob)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: y_std_rob ~ 0 + x_std_rob
Data: simdat (Number of observations: 25)
Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup samples = 2000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
x_std_rob 0.77 0.14 0.50 1.06 1.00 1639 1201
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.57 0.08 0.43 0.76 1.00 1314 977
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
> samples_rob <- posterior_samples(mod_rob, "^b")
> head(samples_rob)
b_x_std_rob
1 0.8917219
2 0.6036900
3 0.9898435
4 0.6937937
5 0.7883487
6 0.8781157
> samples_rob_var <- posterior_samples(mod_rob, "sigma")
> head(samples_rob_var)
sigma
1 0.5646454
2 0.4547035
3 0.6541133
4 0.4691680
5 0.6478816
6 0.4777489