Найти подходящие группы строк в R

У меня есть вектор около 8000 строк. Каждый элемент в векторе является названием компании.

Моя цель

Моя цель состоит в том, чтобы объединить названия этих компаний в группы, чтобы каждый кластер содержал группу названий компаний, которые похожи друг на друга (например: ROYAL DUTCH SHELL, SHELL USA, BMCC SHELL и т. Д... будут принадлежать к одной группе /cluster, поскольку все они являются компаниями, основанными на Shell, т.е. в их названиях есть слово "Shell").

Имея дело с вектором такого размера, кажется, что нужно вечно находить группы похожих компаний, используя технику кластеризации, которую я использовал. However on smaller vectors, this method works well.

Let me demonstrate my approach using an example vector of company names, which is much smaller than the original one.

With a small vector of strings, this approach works very well.

The vector looks something like this

string=c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
     "GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")

Моя попытка

In order to tackle this problem, I used a hierarchical clustering method.

# load packages
pacman::p_load(stringdist, dplyr, tm, gplots)

But some prep work first

#Function to clean strings
str_clean <- function(strings) {
 require(dplyr)
 require(tm)
 strings %>% tolower() %>% removePunctuation() %>% stripWhitespace() %>% 
 trim()
}

# Clean company names
clean_names = str_clean(string)

n = length(clean_names)

Now to calculate the distances between words, to be used for clustering

# Distance methods
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3)  #size of q-gram

dist.methods <- list()

# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
dist = matrix(NA, ncol = n, nrow = n)  #initialize empty matrix
# row.names(dist) = prods
for (i in 1:n) {
for (j in 1:n) {
  dist[i, j] <- stringdist(clean_names[i], clean_names[j], method = methods[m], 
                           q = q[m])
}
}
 dist.methods[[m]] <- dist
 }

Once the distance calculations are done, I choose a method and set an appropriate cut-off

#hierarchical clustering with cut-off of 0.2
clusters <- hclust(as.dist(dist.methods[[3]]))
plot(clusters)
df=as.data.frame(cbind("Companies" = clean_names, "Cluster" = cutree(clusters, h = .99)))

The resulting dataframe has all the company names categorized into clusters, just like I wanted.

df=df %>% group_by(Cluster)

However, like I mentioned, when I use my original vector of 8000 company names, the distance calculations take too long and I cannot proceed.

Мой вопрос

Is there a work-around for this method, when I am working with a larger vector of strings?

Maybe for larger vectors, clustering is not the right solution to this problem? In which case, what else could I do to achieve my result?

Любая помощь будет принята с благодарностью.

1 ответ

Избавьтесь от внутренних двух петель, это то, что замедляет вас, и используйте stringdistmatrix ваш вектор длинный, но строки маленькие, вы увидите эталонный тест внизу.

library(stringdist)

strings <- c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
         "GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")
stringsBig <- rep(strings, 500)    
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3)  #size of q-gram    
dist.methods <- list()

# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
  dist.methods[[m]] <- stringdistmatrix(stringsBig, method = methods[[m]], q = q[[m]])
}

microbenchmark::microbenchmark(stringdistmatrix(stringsBig),
                           for (i in 1:length(strings)) {
                             for (j in 1:length(strings)) {
                              stringdist(strings[i], strings[j])
                             }
                           },times = 100)

# Unit: microseconds
# expr                          min         lq       mean     median        uq       max neval cld
# stringdistmatrix(strings) 105.212   131.2805   241.9271   251.2235   279.634  2909.624   100  a 
# for loop                36147.878 38165.8480 40411.9772 39527.5500 42170.895 54151.457   100   b

microbenchmark::microbenchmark(stringdistmatrix(stringsBig), times=10)
# Unit: seconds
# expr    min       lq    mean   median       uq      max neval
# stringdistmatrix(stringsBig) 1.5324 1.585354 1.66592 1.655901 1.691157 1.825333    10
Другие вопросы по тегам