Лучший и простой способ найти, кто говорил 10 лучших слов гнева из текста разговора
У меня есть фрейм данных, который содержит переменные "AgentID", "Type", "Date" и "Text", а подмножество выглядит следующим образом:
structure(list(AgentID = c("AA0101", "AA0101", "AA0101", "AA0101",
"AA0101"), Type = c("PS", "PS", "PS", "PS", "PS"), Date = c("4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019"), Text = c("I am on social security XXXX and I understand it can not be garnished by Paypal credit because it's federally protected.I owe paypal {$3600.00} I would like them to cancel this please.",
"My XXXX account is being reported late 6 times for XXXX per each loan I was under the impression that I was paying one loan but it's split into three so one payment = 3 or one missed payment would be three missed on my credit,. \n\nMy account is being reported wrong by all credit bureaus because I was in forbearance at the time that these late payments have been reported Section 623 ( a ) ( 2 ) States : If at any time a person who regularly and in the ordinary course of business furnishes information to one or more CRAs determines that the information provided is not complete or accurate, the furnisher must promptly provide complete and accurate information to the CRA. In addition, the furnisher must notify all CRAs that received the information of any corrections, and must thereafter report only the complete and accurate information. \n\nIn this case, I was in forbearance during that tie and document attached proves this. By law, credit need to be reported as of this time with all information and documentation",
"A few weeks ago I started to care for my credit and trying to build it up since I have never used my credit in the past, while checking my I discover some derogatory remarks in my XXXX credit report stating the amount owed of {$1900.00} to XXXX from XX/XX/2015 and another one owed to XXXX for {$1700.00} I would like to address this immediately and either pay off this debt or get this negative remark remove from my report.",
"I disputed this XXXX account with all three credit bureaus, the reported that it was closed in XXXX, now its reflecting closed XXXX once I paid the {$120.00} which I dont believe I owed this amount since it was an fee for a company trying to take money out of my account without my permission, I was charged the fee and my account was closed. I have notified all 3 bureaus to have this removed but they keep saying its correct. One bureau is showing XXXX closed and the other on shows XXXX according to XXXX XXXX, XXXX shows a XXXX, this account has been on my report for seven years",
"On XX/XX/XXXX I went on XXXX XXXX and noticed my score had gone down, went to check out why and seen something from XXXX XXXX and enhanced recovery company ... I also seen that it had come from XXXX and XXXX dated XX/XX/XXXX, XX/XX/XXXX, and XX/XX/XXXX ... I didnt have neither one before, I called and it the rep said it had come from an address Im XXXX XXXX, Florida I have never lived in Florida ever ... .I have also never had XXXX XXXX nor XXXX XXXX ... I need this taken off because it if affecting my credit score ... This is obviously identify theft and fraud..I have never received bills from here which proves that is was not done by me, I havent received any notifications ... if it was not for me checking my score I wouldnt have known nothing of this" )), row.names = c(NA, 5L), class = "data.frame")
Во-первых, я нашел 10 лучших слов гнева, используя следующее:
library(tm)
library(tidytext)
library(tidyverse)
library(sentimentr)
library(wordcloud)
library(ggplot2)
CS <- function(txt){
MC <- Corpus(VectorSource(txt))
SW <- stopwords('english')
MC <- tm_map(MC, tolower)
MC<- tm_map(MC,removePunctuation)
MC <- tm_map(MC, removeNumbers)
MC <- tm_map(MC, removeWords, SW)
MC <- tm_map(MC, stripWhitespace)
myTDM <- as.matrix(TermDocumentMatrix(MC))
v <- sort(rowSums(myTDM), decreasing=TRUE)
FM <- data.frame(word = names(v), freq=v)
row.names(FM) <- NULL
FM <- FM %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, "x") <= 1)
return(FM)
}
DF <- CS(df$Text)
# using nrc
nrc <- get_sentiments("nrc")
# create final dataset
DF_nrc = DF %>% inner_join(nrc)
И я создал вектор из 10 лучших слов гнева следующим образом:
TAW <- DF_nrc %>%
filter(sentiment=="anger") %>%
group_by(word) %>%
summarize(freq = mean(freq)) %>%
arrange(desc(freq)) %>%
top_n(10) %>%
select(word)
Следующее, что я хотел сделать, - это найти "агентов", которые часто произносят эти слова, и оценить их. Но я запутался, как мы могли это сделать? Должен ли я искать слова одно за другим и группировать все по агентам или есть какой-то другой лучший способ. То, на что я смотрю в результате, примерно так:
AgentID Words_Spoken Rank
A0001 theft, dispute, money 1
A0001 theft, fraud, 2
.......
1 ответ
Не самое элегантное решение, но вот как можно посчитать слова по номеру строки:
library(stringr)
# write a new data.frame retaining the AgentID and Date from the original table
new.data <- data.frame(Agent = df$AgentID, Date = df$Date)
# using a for-loop to go through every row of text in the df provided.
for(i in seq(nrow(new.data))){ # i represent row number of the original df
# write a temporary object (e101) that:
## do a boolean check to see if the text from row i df[i, "Text"] the TAW$Word with stringr::str_detect function
## loop the str_detect with sapply so that the str_detect do a boolean check on each TAW$Word
## return the TAW$Word with TAW$Word[...]
e101 <- TAW$word[sapply(TAW$word, function(x) str_detect(df[i, "Text"], x))]
# write the number of returned words in e101 as a corresponding value in new data.frame
new.data[i, "number_of_TAW"] <- length(e101)
# concatenate the returned words in e101 as a corresponding value in new data.frame
new.data[i, "Words_Spoken"] <- ifelse(length(e101)==0, "", paste(e101, collapse=","))
}
new.data
# Agent Date number_of_TAW Words_Spoken
# 1 AA0101 4/1/2019 0
# 2 AA0101 4/1/2019 0
# 3 AA0101 4/1/2019 2 derogatory,remove
# 4 AA0101 4/1/2019 3 fee,money,remove
# 5 AA0101 4/1/2019 1 theft
Если вы больше любите dplyr/tidyverse, вы можете применить подход, используя некоторые dplyr-глаголы, после преобразования ваших текстовых данных в аккуратный формат.
Во- первых, давайте установим некоторые примеры данных с несколькими динамиками, один из которых не говорит нет гнева слов. Вы можете использоватьunnest_tokens()
чтобы выполнить большинство шагов по очистке текста со значениями по умолчанию, такими как разделение токенов, удаление знаков препинания и т. д. Затем удалите стоп-слова, используя anti_join()
. Я показываю использованиеinner_join()
чтобы найти слова гнева как отдельный шаг, но вы можете объединить их в одну большую трубку, если хотите.
library(tidyverse)
library(tidytext)
my_df <- tibble(AgentID = c("AA0101", "AA0101", "AA0102", "AA0103"),
Text = c("I want to report a theft and there has been fraud.",
"I have taken great offense when there was theft and also poison. It is distressing.",
"I only experience soft, fluffy, happy feelings.",
"I have a dispute with the hateful scorpion, and also, I would like to report a fraud."))
my_df
#> # A tibble: 4 x 2
#> AgentID Text
#> <chr> <chr>
#> 1 AA0101 I want to report a theft and there has been fraud.
#> 2 AA0101 I have taken great offense when there was theft and also poison.…
#> 3 AA0102 I only experience soft, fluffy, happy feelings.
#> 4 AA0103 I have a dispute with the hateful scorpion, and also, I would li…
tidy_words <- my_df %>%
unnest_tokens(word, Text) %>%
anti_join(get_stopwords())
#> Joining, by = "word"
anger_words <- tidy_words %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment == "anger"))
#> Joining, by = "word"
anger_words
#> # A tibble: 10 x 3
#> AgentID word sentiment
#> <chr> <chr> <chr>
#> 1 AA0101 theft anger
#> 2 AA0101 fraud anger
#> 3 AA0101 offense anger
#> 4 AA0101 theft anger
#> 5 AA0101 poison anger
#> 6 AA0101 distressing anger
#> 7 AA0103 dispute anger
#> 8 AA0103 hateful anger
#> 9 AA0103 scorpion anger
#> 10 AA0103 fraud anger
Теперь вы знаете, какие слова гнева использовал каждый человек, и следующий шаг - подсчитать их и оценить людей. Пакет dplyr имеет фантастическую поддержку именно для такого рода работы. Сначала ты хочешьgroup_by()
идентификатор человека, затем вычислите несколько обобщенных величин:
- общее количество слов (так что вы можете расставить по этому)
- склеенная строка используемых слов
После этого расположите по количеству слов и создайте новый столбец, в котором будет указан рейтинг.
anger_words %>%
group_by(AgentID) %>%
summarise(TotalWords = n(),
WordsSpoken = paste0(word, collapse = ", ")) %>%
arrange(-TotalWords) %>%
mutate(Rank = row_number())
#> # A tibble: 2 x 4
#> AgentID TotalWords WordsSpoken Rank
#> <chr> <int> <chr> <int>
#> 1 AA0101 6 theft, fraud, offense, theft, poison, distressi… 1
#> 2 AA0103 4 dispute, hateful, scorpion, fraud 2
Обратите внимание, что при таком подходе у вас нет нулевой записи для человека, который не сказал гневных слов; они падают вinner_join()
. Если вы хотите, чтобы они присутствовали в окончательном наборе данных, вам, вероятно, потребуется выполнить резервное копирование с более ранним набором данных и использоватьreplace_na()
.
Создано 11.09.2019 пакетом REPEX (v0.3.0)