Mini Project ~2–3 Weeks Beginner Friendly

News Category Classifier

Train a multi-class text classifier to categorise news articles into topics (Sports, Tech, Business, Politics, Entertainment) using TF-IDF and scikit-learn.

Pythonscikit-learnTF-IDFStreamlitpandasNLTK
Source code + README Milestone breakdown Deployment guide 10 viva Q&A
Start Building → Viva Q&A Learn ML First

A text classification system that automatically categorises news articles with a trained ML model — shows the full ML pipeline from data to deployment.

  • Train on BBC News or AG News dataset (free, public)
  • TF-IDF feature extraction with sklearn Pipeline
  • Logistic Regression + Naive Bayes + Random Forest comparison
  • Confusion matrix and classification report visualisation
  • Live prediction: paste any news article, get category + confidence
  • Deployed Streamlit app with model loaded from pickle

Before You Start

  • Python basics and pandas
  • ML course: Classification topic
  • Basic understanding of text vectorisation

How It Works

📰 News Article
(raw text)
🧹 Text Cleaning
(NLTK preprocessing)
📊 TF-IDF
Vectorisation
🤖 Classifier
(LogReg/NB/RF)
📂 Category
Prediction

Milestone Breakdown

1
Week 1
Data + Feature Engineering
  • Load BBC News dataset from sklearn datasets or kaggle
  • Text preprocessing: lowercase, remove stopwords, stemming (NLTK)
  • TF-IDF vectorisation with max_features=5000
  • Train-test split and baseline accuracy check
2
Week 2
Model Training + Evaluation
  • Train Logistic Regression, Naive Bayes, and Random Forest
  • Compare accuracy, F1-score, and training time
  • Plot confusion matrix with seaborn heatmap
  • Save best model with pickle or joblib
3
Week 3
Streamlit App + Deploy
  • Load pickled model in Streamlit app
  • Build text input + prediction UI with confidence scores
  • Deploy to Streamlit Cloud

Core Implementation

classifier.py
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('tfidf',     TfidfVectorizer(ngram_range=(1,2), max_features=10000, stop_words='english')),
    ('clf',       LogisticRegression(max_iter=1000, C=1.0)),
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

Deploy to Streamlit Cloud (Free)

deploy.sh
# Save trained model
import joblib
joblib.dump(pipeline, 'model.pkl')

# In Streamlit app
model = joblib.load('model.pkl')
prediction = model.predict([article_text])[0]

Create a free account at share.streamlit.io, connect your GitHub repo, and deploy in one click. Your app gets a public URL instantly.

10 Viva Questions with Answers

Q1. What is TF-IDF and why is it used for text classification?
TF-IDF = Term Frequency × Inverse Document Frequency. Words that appear often in a document but rarely in others get high scores. This identifies discriminative words for classification.
Q2. Why do you use ngram_range=(1,2) in TF-IDF?
Unigrams (single words) + bigrams (word pairs) capture more context. "machine learning" as a bigram is more informative than "machine" and "learning" separately.
Q3. What is the difference between Naive Bayes and Logistic Regression for text classification?
Naive Bayes: probabilistic, assumes feature independence, fast, works well with small data. Logistic Regression: discriminative, learns decision boundaries, usually more accurate with enough data.
Q4. Why use a sklearn Pipeline?
Pipeline chains transformers and classifiers, preventing data leakage (TF-IDF fit only on training data), enabling grid search over all parameters, and simplifying deployment (one object does everything).
Q5. What preprocessing steps improve classification accuracy?
Lowercasing, removing stopwords, stemming/lemmatisation, removing HTML tags, handling negations (not_good as one token), removing rare words.
Q6. How would you handle class imbalance if Sports has 1000 articles but Politics has 100?
class_weight=balanced in LogisticRegression adjusts weights. Or: upsample minority classes (SMOTE for text uses different approach), downsample majority, or use ensemble methods robust to imbalance.
Q7. What is the difference between macro and weighted F1-score?
Macro F1: average over classes equally (treats all classes same). Weighted F1: weighted by class frequency. Use weighted F1 for imbalanced datasets.
Q8. How would you improve accuracy beyond TF-IDF + LogReg?
Use sentence embeddings (HuggingFace sentence-transformers) instead of TF-IDF. Fine-tune a BERT model on the dataset. TF-IDF ignores word order and meaning; transformers capture both.
Q9. What are the limitations of your approach?
TF-IDF is bag-of-words — ignores word order and context. Does not handle new vocabulary (OOV words). Needs labelled training data. Cannot classify articles in categories not seen during training.
Q10. How would you add a new category to the classifier?
Collect 200+ labelled examples of the new category. Retrain the model with the new class. Re-evaluate all classes — new class may affect existing accuracy.
🏆

Mark Project Complete

Record your completion and earn your project certificate.