Различные значения P для теста хи-квадрат в Python и R

Question

Различные значения P для теста хи-квадрат в Python и R

Я пытаюсь сделать тест хи-квадрат по двум категориям биологических данных. У меня есть фрейм данных, как это:

         Brain, Cerebelum, Heart, Kidney,  liver,  testis
expected 3        66       1        44       34       88
observed 6        57       4        45       35       69

structure(list(Brain = c(3L, 6L), Cerebelum = c(66L, 57L), heart = c(1L, 
4L), kidney = 44:45, liver = 34:35, testis = c(88L, 69L)), .Names = c("Brain", 
"Cerebelum", "heart", "kidney", "liver", "testis"), class = "data.frame", row.names = c("rand", 
"cns"))

Я сделал тест с использованием Python:

from scipy.stats import chisquare
chisquare(obs,f_exp=exp)

который дает результат как:

Power_divergenceResult(statistic=17.381684491978611, pvalue=0.0038300192430189722)

Я попытался воспроизвести результаты, используя R, поэтому я создал CSV-файл, импортированный в R в качестве кадра данных, и запустил код следующим образом:

d<-read.csv(file)
chisq.test(d)

Pearson's Chi-squared test

data:  d
X-squared = 4.9083, df = 5, p-value = 0.4272

почему квадраты хи и значение P различаются в Python и R?, Как я рассчитал вручную, используя простую (OE)^2/E формулу, значение хи-квадрат равно 17,38, как рассчитано Python, но я не могу понять, из того, как R рассчитать значение 4,90.

1

r python-3.x chi-squared

Источник

user3015703 21 апр '16 в 03:26

1 ответ

Решение

Другие вопросы по тегам r python-3.x chi-squared

user4598520 21 апр '16 в 05:31 2016-04-21 05:31 · Accepted Answer · 2016-04-21 05:31

Я могу ответить на ваш первый вопрос.

chisq.test, когда вы даете ему матрицу с > 2 строк и столбцов, обрабатывает его как двумерную таблицу сопряженности и проверяет независимость между наблюдениями вдоль строк и столбцов. Вот пример и еще один.

scipy.stats.chisq с другой стороны, просто делает X = sum( (O_i-E_i)^2 / E_i) знакомо по определению теста на стат.

Итак, как выровнять круг? Первый проход R наблюдаемые значения, а затем определить ожидаемые вероятности в аргументе p, Во-вторых, вам также нужно помешать R выполнить исправление непрерывности по умолчанию.

e <- d[1, ]
o <- d[2, ]
chisq.test(o, p = e / sum(e), correct = FALSE)

вуаля

Chi-squared test for given probabilities

data:  o
X-squared = 17.139, df = 5, p-value = 0.004243

PS Хитрый вопрос для SO, возможно, лучше для перекрестного подтверждения? Обратите внимание, что исправление R по умолчанию может быть хорошей вещью против scipy, Является ли это правдой, безусловно, для перекрестной проверки.

PPS Помощь в ?chisq.test это немного трудно разобрать, но я думаю, что это все где-то там;)

 If ‘x’ is a matrix with one row or column, or if ‘x’ is a vector
 and ‘y’ is not given, then a _goodness-of-fit test_ is performed
 (‘x’ is treated as a one-dimensional contingency table).  The
 entries of ‘x’ must be non-negative integers.  In this case, the
 hypothesis tested is whether the population probabilities equal
 those in ‘p’, or are all equal if ‘p’ is not given.

 If ‘x’ is a matrix with at least two rows and columns, it is taken
 as a two-dimensional contingency table: the entries of ‘x’ must be
 non-negative integers.  Otherwise, ‘x’ and ‘y’ must be vectors or
 factors of the same length; cases with missing values are removed,
 the objects are coerced to factors, and the contingency table is
 computed from these.  Then Pearson's chi-squared test is performed
 of the null hypothesis that the joint distribution of the cell
 counts in a 2-dimensional contingency table is the product of the
 row and column marginals.

а также

 correct: a logical indicating whether to apply continuity correction
          when computing the test statistic for 2 by 2 tables: one half
          is subtracted from all |O - E| differences; however, the
          correction will not be bigger than the differences
          themselves.  No correction is done if ‘simulate.p.value =
          TRUE’.