News Category Classifier

what you build

A text classification system that automatically categorises news articles with a trained ML model — shows the full ML pipeline from data to deployment.

Train on BBC News or AG News dataset (free, public)
TF-IDF feature extraction with sklearn Pipeline
Logistic Regression + Naive Bayes + Random Forest comparison
Confusion matrix and classification report visualisation
Live prediction: paste any news article, get category + confidence
Deployed Streamlit app with model loaded from pickle

prerequisites

Before You Start

Python basics and pandas
ML course: Classification topic
Basic understanding of text vectorisation

architecture

How It Works

📰 News Article
(raw text)

→

🧹 Text Cleaning
(NLTK preprocessing)

→

📊 TF-IDF
Vectorisation

→

🤖 Classifier
(LogReg/NB/RF)

→

📂 Category
Prediction

3-week plan

Milestone Breakdown

Week 1

Data + Feature Engineering

Load BBC News dataset from sklearn datasets or kaggle
Text preprocessing: lowercase, remove stopwords, stemming (NLTK)
TF-IDF vectorisation with max_features=5000
Train-test split and baseline accuracy check

Week 2

Model Training + Evaluation

Train Logistic Regression, Naive Bayes, and Random Forest
Compare accuracy, F1-score, and training time
Plot confusion matrix with seaborn heatmap
Save best model with pickle or joblib

Week 3

Streamlit App + Deploy

Load pickled model in Streamlit app
Build text input + prediction UI with confidence scores
Deploy to Streamlit Cloud

key code

Core Implementation

classifier.py

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('tfidf',     TfidfVectorizer(ngram_range=(1,2), max_features=10000, stop_words='english')),
    ('clf',       LogisticRegression(max_iter=1000, C=1.0)),
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

deployment

Deploy to Streamlit Cloud (Free)

deploy.sh

# Save trained model
import joblib
joblib.dump(pipeline, 'model.pkl')

# In Streamlit app
model = joblib.load('model.pkl')
prediction = model.predict([article_text])[0]

Create a free account at share.streamlit.io, connect your GitHub repo, and deploy in one click. Your app gets a public URL instantly.

viva prep

10 Viva Questions with Answers

Q1. What is TF-IDF and why is it used for text classification?

TF-IDF = Term Frequency × Inverse Document Frequency. Words that appear often in a document but rarely in others get high scores. This identifies discriminative words for classification.

Q2. Why do you use ngram_range=(1,2) in TF-IDF?

Unigrams (single words) + bigrams (word pairs) capture more context. "machine learning" as a bigram is more informative than "machine" and "learning" separately.

Q3. What is the difference between Naive Bayes and Logistic Regression for text classification?

Naive Bayes: probabilistic, assumes feature independence, fast, works well with small data. Logistic Regression: discriminative, learns decision boundaries, usually more accurate with enough data.

Q4. Why use a sklearn Pipeline?

Pipeline chains transformers and classifiers, preventing data leakage (TF-IDF fit only on training data), enabling grid search over all parameters, and simplifying deployment (one object does everything).

Q5. What preprocessing steps improve classification accuracy?

Lowercasing, removing stopwords, stemming/lemmatisation, removing HTML tags, handling negations (not_good as one token), removing rare words.

Q6. How would you handle class imbalance if Sports has 1000 articles but Politics has 100?

class_weight=balanced in LogisticRegression adjusts weights. Or: upsample minority classes (SMOTE for text uses different approach), downsample majority, or use ensemble methods robust to imbalance.

Q7. What is the difference between macro and weighted F1-score?

Macro F1: average over classes equally (treats all classes same). Weighted F1: weighted by class frequency. Use weighted F1 for imbalanced datasets.

Q8. How would you improve accuracy beyond TF-IDF + LogReg?

Use sentence embeddings (HuggingFace sentence-transformers) instead of TF-IDF. Fine-tune a BERT model on the dataset. TF-IDF ignores word order and meaning; transformers capture both.

Q9. What are the limitations of your approach?

TF-IDF is bag-of-words — ignores word order and context. Does not handle new vocabulary (OOV words). Needs labelled training data. Cannot classify articles in categories not seen during training.

Q10. How would you add a new category to the classifier?

Collect 200+ labelled examples of the new category. Retrain the model with the new class. Re-evaluate all classes — new class may affect existing accuracy.