Найти подходящие группы строк в R
У меня есть вектор около 8000 строк. Каждый элемент в векторе является названием компании.
Моя цель
Моя цель состоит в том, чтобы объединить названия этих компаний в группы, чтобы каждый кластер содержал группу названий компаний, которые похожи друг на друга (например: ROYAL DUTCH SHELL, SHELL USA, BMCC SHELL и т. Д... будут принадлежать к одной группе /cluster, поскольку все они являются компаниями, основанными на Shell, т.е. в их названиях есть слово "Shell").
Имея дело с вектором такого размера, кажется, что нужно вечно находить группы похожих компаний, используя технику кластеризации, которую я использовал. However on smaller vectors, this method works well.
Let me demonstrate my approach using an example vector of company names, which is much smaller than the original one.
With a small vector of strings, this approach works very well.
The vector looks something like this
string=c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
"GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")
Моя попытка
In order to tackle this problem, I used a hierarchical clustering method.
# load packages
pacman::p_load(stringdist, dplyr, tm, gplots)
But some prep work first
#Function to clean strings
str_clean <- function(strings) {
require(dplyr)
require(tm)
strings %>% tolower() %>% removePunctuation() %>% stripWhitespace() %>%
trim()
}
# Clean company names
clean_names = str_clean(string)
n = length(clean_names)
Now to calculate the distances between words, to be used for clustering
# Distance methods
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3) #size of q-gram
dist.methods <- list()
# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
dist = matrix(NA, ncol = n, nrow = n) #initialize empty matrix
# row.names(dist) = prods
for (i in 1:n) {
for (j in 1:n) {
dist[i, j] <- stringdist(clean_names[i], clean_names[j], method = methods[m],
q = q[m])
}
}
dist.methods[[m]] <- dist
}
Once the distance calculations are done, I choose a method and set an appropriate cut-off
#hierarchical clustering with cut-off of 0.2
clusters <- hclust(as.dist(dist.methods[[3]]))
plot(clusters)
df=as.data.frame(cbind("Companies" = clean_names, "Cluster" = cutree(clusters, h = .99)))
The resulting dataframe has all the company names categorized into clusters, just like I wanted.
df=df %>% group_by(Cluster)
However, like I mentioned, when I use my original vector of 8000 company names, the distance calculations take too long and I cannot proceed.
Мой вопрос
Is there a work-around for this method, when I am working with a larger vector of strings?
Maybe for larger vectors, clustering is not the right solution to this problem? In which case, what else could I do to achieve my result?
Любая помощь будет принята с благодарностью.
1 ответ
Избавьтесь от внутренних двух петель, это то, что замедляет вас, и используйте stringdistmatrix
ваш вектор длинный, но строки маленькие, вы увидите эталонный тест внизу.
library(stringdist)
strings <- c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
"GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")
stringsBig <- rep(strings, 500)
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3) #size of q-gram
dist.methods <- list()
# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
dist.methods[[m]] <- stringdistmatrix(stringsBig, method = methods[[m]], q = q[[m]])
}
microbenchmark::microbenchmark(stringdistmatrix(stringsBig),
for (i in 1:length(strings)) {
for (j in 1:length(strings)) {
stringdist(strings[i], strings[j])
}
},times = 100)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# stringdistmatrix(strings) 105.212 131.2805 241.9271 251.2235 279.634 2909.624 100 a
# for loop 36147.878 38165.8480 40411.9772 39527.5500 42170.895 54151.457 100 b
microbenchmark::microbenchmark(stringdistmatrix(stringsBig), times=10)
# Unit: seconds
# expr min lq mean median uq max neval
# stringdistmatrix(stringsBig) 1.5324 1.585354 1.66592 1.655901 1.691157 1.825333 10