Discovering themes in Python text using NLTK and machine learning algorithms
Preparation work:
1. Install Python: NLTK is a library of Python that needs to be installed first. You can download and install the appropriate version from the Python official website.
2. Install NLTK: Use the pip command to install the NLTK library. Run the following command from the terminal or command prompt:
pip install nltk
3. Download NLTK's auxiliary resources: Some functions (such as the list of Stop word) need to download additional data. Run the following code in the Python interactive environment:
python
import nltk
nltk.download()
This will pop up a download window, where you can select the required resources and download them.
4. Install machine learning libraries: In order to use machine learning algorithms, we also need to install other class libraries, such as scikit learn. Install using the pip command:
pip install scikit-learn
Dependent class libraries:
NLTK、scikit-learn
Dataset:
We will use 20 types of news text datasets (20 Newsgroups datasets) as sample data. This dataset contains news texts on multiple topics, such as sports, technology, politics, etc. You can download the dataset from the following website: http://qwone.com/ ~Jason/20Newsgroups/20news bydate. tar. gz
The complete sample code is as follows:
python
import os
import tarfile
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#Decompress dataset files
def untar_archive(archive_path, extract_dir):
with tarfile.open(archive_path, 'r:gz') as tar:
tar.extractall(path=extract_dir)
#Read the dataset and preprocess it
def load_dataset(dataset_dir):
dataset = load_files(dataset_dir, encoding='utf-8', shuffle=False)
return dataset
def preprocess_text(text):
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
#Convert text to lowercase and divide words
words = text.lower().split()
#Remove Stop word and non alphabetic characters, and perform Lemmatisation
words = [lemmatizer.lemmatize(word) for word in words if word.isalpha() and word not in stop_words]
return ' '.join(words)
#Load Text Data from Dataset Catalog
data_dir = '20news-bydate'
if not os.path.exists(data_dir):
tar_path = '20news-bydate.tar.gz'
untar_archive(tar_path, data_dir)
dataset = load_dataset(data_dir)
#Preprocessing Text Data
preprocessed_data = [preprocess_text(text) for text in dataset.data]
#Feature extraction using Bag-of-words model model
vectorizer = CountVectorizer(max_features=1000)
feature_matrix = vectorizer.fit_transform(preprocessed_data)
#Using LDA for Theme Modeling
lda = LatentDirichletAllocation(n_components=5)
lda.fit(feature_matrix)
#Output the first few keywords of each topic
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
top_words = [feature_names[idx] for idx in topic.argsort()[:-10-1:-1]]
print(f"Topic #{topic_idx+1}: {', '.join(top_words)}")
This sample code first decompresses the 20 types of news text data sets, and uses the NLTK library for preprocessing, including text conversion to lowercase, word segmentation, removal of Stop word, Lemmatisation, and so on. Then use the CountVectorizer of the scikit learn library to convert the text data into a feature matrix. Finally, use LatentDirichletAllocation for topic modeling and output the keywords for each topic.
Note: To fully run the above code, you need to download and extract the 20 types of news datasets, and then specify the dataset directory as' data '_ The value of the dir ` variable.