Python uses Gensim document similarity calculation
To use Gensim for document similarity calculation, the following preparatory work is required:
1. Install Python: Ensure that you have already installed Python. You can download the latest version from the official Python website.
2. Install Gensim: Use the following command to install the Gensim library:
pip install gensim
3. Prepare the dataset: You can use any text dataset to train the model. In this example, we will use 20 newsgroup datasets. You can download it from the following website:
http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz
Next, we will provide a detailed introduction to how to use Gensim to calculate document similarity.
Firstly, import the required modules:
python
from gensim import corpora
from gensim.models import TfidfModel
from gensim.similarities import Similarity
Next, load and preprocess the document data. In this example, we will simply use some sample text data.
python
documents = [
"This is the first document",
"This document is the second document",
"And this is the third one",
"Is this the first document?"
]
#Split words and convert each word to lowercase
texts = [[word for word in document.lower().split()] for document in documents]
#Create a Dictionary object
dictionary = corpora.Dictionary(texts)
#Using a Dictionary object to convert a document into a vector representation
corpus = [dictionary.doc2bow(text) for text in texts]
Next, train a TF-IDF model:
python
#Train a TF-IDF model
tfidf = TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
Finally, use a similarity model to calculate document similarity:
python
#Create similarity index
index = Similarity("index", corpus_tfidf, num_features=len(dictionary))
#Define a query document
query = "This is a document"
#Convert query documents to vector representations
query_bow = dictionary.doc2bow(query.lower().split())
#Calculate similarity score
query_tfidf = tfidf[query_bow]
sims = index[query_tfidf]
#Sort in descending order of similarity
results = sorted(enumerate(sims), key=lambda item: -item[1])
#Print similarity results
for idx, score in results:
print(f"Document: {documents[idx]}, Similarity Score: {score}")
This is the complete example code for calculating document similarity using Gensim. You can experiment according to your own needs by changing the dataset and querying documents.