Implementing Text Classification Using SpaCy in Python
To implement the text classification task using spaCy, we need to do the following preparatory work:
1. Install the spaCy library: You can use the pip command to install spaCy, and run the following command to install it:
pip install spacy
2. Download the English model of SpaCy: SpaCy provides pre trained models, and we choose the appropriate model for text classification. Run the following command to download the English model:
python -m spacy download en_core_web_sm
3. Download Dataset: For text classification tasks, we need a dataset that has already been labeled with categories. In this example, we used 20 types of news classification datasets. You can download the dataset at the following link:
[Dataset download address]( http://archive.ics.uci.edu/ml/datasets/Twenty +Newsgroups)
After downloading, extract and extract a folder that contains text files for the training and testing sets.
Now, we can achieve text classification through the following complete examples:
python
import spacy
import os
import random
#Loading SpaCy English Model
nlp = spacy.load("en_core_web_sm")
#Define Dataset Path
data_dir = "<path_to_data>"
train_file = os.path.join(data_dir, "20news_train.txt")
test_file = os.path.join(data_dir, "20news_test.txt")
#Read Dataset File
def load_data(file_path):
texts = []
labels = []
with open(file_path, "r") as file:
for line in file:
label, text = line.strip().split("\t")
texts.append(text)
labels.append(label)
return texts, labels
#Loading training and testing datasets
train_texts, train_labels = load_data(train_file)
test_texts, test_labels = load_data(test_file)
#Preprocessing text in the dataset
def preprocess_text(text):
doc = nlp(text, disable=["parser", "ner"])
preprocessed_text = " ".join([token.lemma_.lower() for token in doc])
return preprocessed_text
#Text preprocessing of training and testing data
train_texts_processed = [preprocess_text(text) for text in train_texts]
test_texts_processed = [preprocess_text(text) for text in test_texts]
#Convert text to feature vector representation
X_train, X_test = [], []
for text in train_texts_processed:
vector = nlp(text).vector
X_train.append(vector)
for text in test_texts_processed:
vector = nlp(text).vector
X_test.append(vector)
#Define classifier model
from sklearn.svm import LinearSVC
classifier = LinearSVC()
#Train classifier model
classifier.fit(X_train, train_labels)
#Making predictions on the test set
predicted_labels = classifier.predict(X_test)
#Output prediction results
for text, true_label, predicted_label in zip(test_texts, test_labels, predicted_labels):
print(f"Text: {text}")
print(f"True Label: {true_label}")
print(f"Predicted Label: {predicted_label}")
print()
#Output classification accuracy
accuracy = sum(predicted_labels == test_labels) / len(test_labels)
print(f"Accuracy: {accuracy}")
Please add '<path'_ To_ Replace with the folder path where you downloaded the dataset. The main steps in the source code include:
1. Load the SpaCy English model.
2. Define the dataset path and read training and testing data files.
3. Preprocess the text and convert it into feature vector representation.
4. Define a classifier model and train the model.
5. Make predictions on the test set and output the prediction results and classification accuracy.
I hope this can help you!