Python using Gensim text clustering
To use Gensim for text clustering, it is necessary to install Python and related class libraries. The following are the preparations for environment setup and the required class libraries:
1. Install Python: Visit the official Python website( https://www.python.org/downloads/ )Download and install the appropriate Python version for your operating system.
2. Install Gensim: Run the following command from the command prompt or terminal to install Gensim:
pip install gensim
3. Install other dependent class libraries: Gensim also relies on other class libraries, such as numpy and scipy. Run the following command to install these dependent class libraries:
pip install numpy scipy
Next, we will use a news article dataset of 20 categories to demonstrate an example of text clustering. You can download the dataset from the following website: http://archive.ics.uci.edu/ml/datasets/Twenty +Newsgroups
After downloading and decompressing the dataset, you will receive a folder called "20_newsgroups", which contains multiple subfolders, each representing a news category.
Now, let's implement a complete text clustering example:
python
import os
from gensim import models
from gensim.parsing import preprocessing
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans
#Set Dataset Path
data_path = "path/to/20_newsgroups"
#Retrieve news data
categories = ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med',
'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast',
'talk.politics.misc', 'talk.religion.misc']
data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
#Preprocessing and vectorizing text
preprocess = preprocessing.Preprocessor()
normalize = preprocessing.Normalize()
cleaned_data = [preprocess(s) for s in data_train.data]
normalized_data = [normalize(s) for s in cleaned_data]
vectorizer = CountVectorizer(stop_words='english')
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(normalized_data)
tfidf_X = tfidf_vectorizer.fit_transform(normalized_data)
#Document object converted to Gensim
corpus = models.MmCorpus.serialize('corpus.mm', X)
tfidf_corpus = models.MmCorpus.serialize('tfidf_corpus.mm', tfidf_X)
#Clustering using KMeans
num_clusters = 5
kmeans_model = KMeans(n_clusters=num_clusters, random_state=42, n_init=5)
kmeans_model.fit(X)
#Print clustering results
print("Top terms per cluster:")
order_centroids = kmeans_model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(num_clusters):
print("Cluster %d:" % i)
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind])
Please ensure to replace the 'data_path' variable with the actual path after downloading the dataset. This example uses 20 categories of news articles and divides them into 5 clusters using the KMeans algorithm. The example also includes preprocessing and vectorization steps for the text.
I hope this can help you start using Gensim for text clustering.