Implementing Text Classification Using SpaCy in Python

To implement the text classification task using spaCy, we need to do the following preparatory work: 1. Install the spaCy library: You can use the pip command to install spaCy, and run the following command to install it: pip install spacy 2. Download the English model of SpaCy: SpaCy provides pre trained models, and we choose the appropriate model for text classification. Run the following command to download the English model: python -m spacy download en_core_web_sm 3. Download Dataset: For text classification tasks, we need a dataset that has already been labeled with categories. In this example, we used 20 types of news classification datasets. You can download the dataset at the following link: [Dataset download address]( http://archive.ics.uci.edu/ml/datasets/Twenty +Newsgroups) After downloading, extract and extract a folder that contains text files for the training and testing sets. Now, we can achieve text classification through the following complete examples: python import spacy import os import random #Loading SpaCy English Model nlp = spacy.load("en_core_web_sm") #Define Dataset Path data_dir = "<path_to_data>" train_file = os.path.join(data_dir, "20news_train.txt") test_file = os.path.join(data_dir, "20news_test.txt") #Read Dataset File def load_data(file_path): texts = [] labels = [] with open(file_path, "r") as file: for line in file: label, text = line.strip().split("\t") texts.append(text) labels.append(label) return texts, labels #Loading training and testing datasets train_texts, train_labels = load_data(train_file) test_texts, test_labels = load_data(test_file) #Preprocessing text in the dataset def preprocess_text(text): doc = nlp(text, disable=["parser", "ner"]) preprocessed_text = " ".join([token.lemma_.lower() for token in doc]) return preprocessed_text #Text preprocessing of training and testing data train_texts_processed = [preprocess_text(text) for text in train_texts] test_texts_processed = [preprocess_text(text) for text in test_texts] #Convert text to feature vector representation X_train, X_test = [], [] for text in train_texts_processed: vector = nlp(text).vector X_train.append(vector) for text in test_texts_processed: vector = nlp(text).vector X_test.append(vector) #Define classifier model from sklearn.svm import LinearSVC classifier = LinearSVC() #Train classifier model classifier.fit(X_train, train_labels) #Making predictions on the test set predicted_labels = classifier.predict(X_test) #Output prediction results for text, true_label, predicted_label in zip(test_texts, test_labels, predicted_labels): print(f"Text: {text}") print(f"True Label: {true_label}") print(f"Predicted Label: {predicted_label}") print() #Output classification accuracy accuracy = sum(predicted_labels == test_labels) / len(test_labels) print(f"Accuracy: {accuracy}") Please add '<path'_ To_ Replace with the folder path where you downloaded the dataset. The main steps in the source code include: 1. Load the SpaCy English model. 2. Define the dataset path and read training and testing data files. 3. Preprocess the text and convert it into feature vector representation. 4. Define a classifier model and train the model. 5. Make predictions on the test set and output the prediction results and classification accuracy. I hope this can help you!