Discovering themes in Python text using NLTK and machine learning algorithms

Preparation work: 1. Install Python: NLTK is a library of Python that needs to be installed first. You can download and install the appropriate version from the Python official website. 2. Install NLTK: Use the pip command to install the NLTK library. Run the following command from the terminal or command prompt: pip install nltk 3. Download NLTK's auxiliary resources: Some functions (such as the list of Stop word) need to download additional data. Run the following code in the Python interactive environment: python import nltk nltk.download() This will pop up a download window, where you can select the required resources and download them. 4. Install machine learning libraries: In order to use machine learning algorithms, we also need to install other class libraries, such as scikit learn. Install using the pip command: pip install scikit-learn Dependent class libraries: NLTK、scikit-learn Dataset: We will use 20 types of news text datasets (20 Newsgroups datasets) as sample data. This dataset contains news texts on multiple topics, such as sports, technology, politics, etc. You can download the dataset from the following website: http://qwone.com/ ~Jason/20Newsgroups/20news bydate. tar. gz The complete sample code is as follows: python import os import tarfile from sklearn.datasets import load_files from sklearn.feature_extraction.text import CountVectorizer from sklearn.decomposition import LatentDirichletAllocation from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer #Decompress dataset files def untar_archive(archive_path, extract_dir): with tarfile.open(archive_path, 'r:gz') as tar: tar.extractall(path=extract_dir) #Read the dataset and preprocess it def load_dataset(dataset_dir): dataset = load_files(dataset_dir, encoding='utf-8', shuffle=False) return dataset def preprocess_text(text): lemmatizer = WordNetLemmatizer() stop_words = set(stopwords.words('english')) #Convert text to lowercase and divide words words = text.lower().split() #Remove Stop word and non alphabetic characters, and perform Lemmatisation words = [lemmatizer.lemmatize(word) for word in words if word.isalpha() and word not in stop_words] return ' '.join(words) #Load Text Data from Dataset Catalog data_dir = '20news-bydate' if not os.path.exists(data_dir): tar_path = '20news-bydate.tar.gz' untar_archive(tar_path, data_dir) dataset = load_dataset(data_dir) #Preprocessing Text Data preprocessed_data = [preprocess_text(text) for text in dataset.data] #Feature extraction using Bag-of-words model model vectorizer = CountVectorizer(max_features=1000) feature_matrix = vectorizer.fit_transform(preprocessed_data) #Using LDA for Theme Modeling lda = LatentDirichletAllocation(n_components=5) lda.fit(feature_matrix) #Output the first few keywords of each topic feature_names = vectorizer.get_feature_names() for topic_idx, topic in enumerate(lda.components_): top_words = [feature_names[idx] for idx in topic.argsort()[:-10-1:-1]] print(f"Topic #{topic_idx+1}: {', '.join(top_words)}") This sample code first decompresses the 20 types of news text data sets, and uses the NLTK library for preprocessing, including text conversion to lowercase, word segmentation, removal of Stop word, Lemmatisation, and so on. Then use the CountVectorizer of the scikit learn library to convert the text data into a feature matrix. Finally, use LatentDirichletAllocation for topic modeling and output the keywords for each topic. Note: To fully run the above code, you need to download and extract the 20 types of news datasets, and then specify the dataset directory as' data '_ The value of the dir ` variable.