Python using Gensim text preprocessing
Environmental construction and preparation work:
1. Install Python: From the official Python website( https://www.python.org/downloads/ )Download the appropriate Python version and follow the instructions to install it.
2. Install Gensim: Open the command line terminal and execute the following command to install Gensim:
pip install gensim
3. Download Dataset: In this example, we will use the 'Text8' dataset provided by Gensim. You can download the dataset from the following website:
http://mattmahoney.net/dc/text8.zip
After downloading, extract it and place the 'text8' file in the project directory.
Dependent class libraries:
In this example, we will use the following class libraries:
-Gensim: used for text preprocessing and Vector space modeling.
-Nltk: used to build the Bag-of-words model model and filter Stop word.
Sample data:
We will use the Text8 dataset for text preprocessing. The Text8 dataset is a preprocessed English Wikipedia corpus that has been converted to a string of words separated only by spaces.
The example code is as follows:
python
from gensim import corpora
from nltk.corpus import stopwords
from gensim.models import Word2Vec
import string
import nltk
#Download Stop word
nltk.download('stopwords')
def preprocess_text(text):
#Convert text to lowercase
text = text.lower()
#Remove Punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
#Remove Stop word
stop_words = set(stopwords.words('english'))
text = ' '.join(word for word in text.split() if word not in stop_words)
return text
def main():
#Read Dataset
with open('text8') as f:
text = f.read()
#Preprocessing Text
processed_text = preprocess_text(text)
#Convert text into a list of sentences
sentences = processed_text.split('.')
#Convert sentence list to word list
word_lists = [sentence.split() for sentence in sentences]
#Create Bag-of-words model model
dictionary = corpora.Dictionary(word_lists)
#Create a corpus to convert text into vector representations
corpus = [dictionary.doc2bow(word_list) for word_list in word_lists]
#Train Word2Vec model
model = Word2Vec(word_lists, min_count=1)
if __name__ == '__main__':
main()
Please note that this example only includes the basic steps of text preprocessing and Word2Vec model training. You can choose and modify according to your own needs.