Различия в dtm различаются в зависимости от tf/tfidf, один и тот же корпус

Question

Различия в dtm различаются в зависимости от tf/tfidf, один и тот же корпус

Кто-нибудь может объяснить?

Мое понимание:

tf >= 0 (absolute frequency value)

tfidf >= 0 (for negative idf, tf=0)



sparse entry = 0

nonsparse entry > 0

Таким образом, точная пропорция должна быть одинаковой в двух DTM, созданных с помощью приведенного ниже кода.

library(tm)
data(crude)

dtm <- DocumentTermMatrix(crude, control=list(weighting=weightTf))
dtm2 <- DocumentTermMatrix(crude, control=list(weighting=weightTfIdf))
dtm
dtm2

Но:

> dtm
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2255/23065**
Sparsity           : 91%
Maximal term length: 17
Weighting          : term frequency (tf)
> dtm2
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2215/23105**
Sparsity           : 91%
Maximal term length: 17
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

2

r text-processing tf-idf tm

Источник

user6300663 29 ноя '16 в 12:35

1 ответ

Решение

Другие вопросы по тегам r text-processing tf-idf tm

user1327739 29 ноя '16 в 13:23 2016-11-29 13:23 · Accepted Answer · 2016-11-29 13:23

Разреженность может отличаться. Значение TF-IDF будет равно нулю, если TF равно нулю или если IDF равно нулю, а IDF равно нулю, если в каждом документе встречается термин. Рассмотрим следующий пример:

txts <- c("super World", "Hello World", "Hello super top world")
library(tm)
tf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTf))
tfidf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTfIdf))

inspect(tf)
# <<TermDocumentMatrix (terms: 4, documents: 3)>>
# Non-/sparse entries: 8/4
# Sparsity           : 33%
# Maximal term length: 5
# Weighting          : term frequency (tf)
# 
#        Docs
# Terms   1 2 3
#   hello 0 1 1
#   super 1 0 1
#   top   0 0 1
#   world 1 1 1

inspect(tfidf)
# <<TermDocumentMatrix (terms: 4, documents: 3)>>
# Non-/sparse entries: 5/7
# Sparsity           : 58%
# Maximal term length: 5
# Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
# 
#        Docs
# Terms           1         2         3
#   hello 0.0000000 0.2924813 0.1462406
#   super 0.2924813 0.0000000 0.1462406
#   top   0.0000000 0.0000000 0.3962406
#   world 0.0000000 0.0000000 0.0000000

Термин super встречается 1 раз в документе 1, который имеет 2 термина, и встречается в 2 из 3 документов:

1/2 * log2(3/2)
# [1] 0.2924813

Термин мир встречается 1 раз в документе 3, который имеет 4 термина, и встречается во всех 3 документах:

1/4 * log2(3/3) # 1/4 * 0
# [1] 0