深度阅读

How to implement text classification using scikit-learn

作者
作者
2023年08月22日
更新时间
18.36 分钟
阅读时间
0
阅读量

To implement text classification using scikit-learn, you can use a bag-of-words representation of the text data along with a classification algorithm, such as logistic regression or a support vector machine (SVM). Here’s an example code snippet that illustrates this approach:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)

# Convert the text data into feature vectors using a bag-of-words representation
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Train a logistic regression classifier on the training data
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Evaluate the classifier on the test data
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

In this code, we load the text data from a CSV file and split it into training and test sets. We then convert the text data into feature vectors using a CountVectorizer object, which represents the data using a bag-of-words representation. We train a logistic regression classifier on the training data and evaluate the classifier on the test data using the accuracy score metric.

Note that this represents just one approach to text classification using scikit-learn, and there are many other algorithms and techniques that can be used as well. You may need to experiment with different approaches to find the best one for your specific task and data.

博客作者

热爱技术,乐于分享,持续学习。专注于Web开发、系统架构设计和人工智能领域。