Используйте термины из tf-idf для моделирования темы в Python
У меня есть dataframe, который имеет текстовый столбец. Я очистил данные и применил tf-idf, чтобы получить важные условия из документов. Теперь я хочу передать эти термины в LDA, чтобы получить темы. И я не знаю, как это сделать.
from sklearn.feature_extraction.text import TfidfVectorizer
import itertools
import numpy as np
import pandas as pd
text=["Sugar is bad to consume. My sister likes to have sugar, but not my father.",
"My father spends a lot of time driving my sister around to dance practice.",
"Health experts say that Sugar is not good for your lifestyle.",
"Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.",
"Doctors suggest that driving may cause increased stress and blood pressure.",
"with a new assignment. A new topic",
"The current topic – word assignment is updated with a new topic with the probability",
"documents contain fewer topics. On the other hand, higher the beta"
]
df=pd.DataFrame(text)
df=df.rename(columns={0:"text"})
# Cleaning data
df['text'] = df['text'].str.replace('[^\w\s]','')
df['text'] = df['text'].str.replace('\d+', '')
df['text'] = df['text'].apply(lambda x: " ".join(x.lower() for x in x.split())) ### to lower case
stop = stopwords.words('english')
df['text']= df['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
doc_clean=df['text'].values.tolist()
#define vectorizer parameters
# tf-idf vector
tfidf_vectorizer = TfidfVectorizer(max_features=200000,
stop_words='english',
use_idf=True,ngram_range=(3,3))
%time tfidf_matrix = tfidf_vectorizer.fit_transform(doc_clean)
print(tfidf_matrix.shape)
fully_indexed = []
index_value={i[1]:i[0] for i in tfidf_vectorizer.vocabulary_.items()}
for row in tfidf_matrix:
fully_indexed.append({index_value[column]:value for (column,value) in zip(row.indices,row.data)})
top2ReasonsVal = [sorted([(k,v) for k,v in d.items() if (v>0.35)],key = lambda x:x[1])[-2:] for d in fully_indexed]
overall2ReasonsVal =(sorted(list(itertools.chain(*top2ReasonsVal)), key=lambda x: x[1]))
sorted(list(set(overall2ReasonsVal)), key=lambda x: x[1], reverse=True)
Результат векторизации tf-idf:
[('new assignment new', 0.7071067811865475),
('assignment new topic', 0.7071067811865475),
('say sugar good', 0.5),
('sugar good lifestyle', 0.5),
('hand higher beta', 0.447213595499958),
('topics hand higher', 0.447213595499958),
('likes sugar father', 0.447213595499958),
('sister likes sugar', 0.447213595499958),
('sister dance practice', 0.408248290463863),
('stress blood pressure', 0.408248290463863),
('updated new topic', 0.408248290463863),
('increased stress blood', 0.408248290463863),
('new topic probability', 0.408248290463863),
('driving sister dance', 0.408248290463863),
('drive sister better', 0.408248290463863),
('father drive sister', 0.408248290463863)]
Заранее спасибо.