Q1. What is TF-IDF and why is it used for text classification?
TF-IDF = Term Frequency × Inverse Document Frequency. Words that appear often in a document but rarely in others get high scores. This identifies discriminative words for classification.
Q2. Why do you use ngram_range=(1,2) in TF-IDF?
Unigrams (single words) + bigrams (word pairs) capture more context. "machine learning" as a bigram is more informative than "machine" and "learning" separately.
Q3. What is the difference between Naive Bayes and Logistic Regression for text classification?
Naive Bayes: probabilistic, assumes feature independence, fast, works well with small data. Logistic Regression: discriminative, learns decision boundaries, usually more accurate with enough data.
Q4. Why use a sklearn Pipeline?
Pipeline chains transformers and classifiers, preventing data leakage (TF-IDF fit only on training data), enabling grid search over all parameters, and simplifying deployment (one object does everything).
Q5. What preprocessing steps improve classification accuracy?
Lowercasing, removing stopwords, stemming/lemmatisation, removing HTML tags, handling negations (not_good as one token), removing rare words.
Q6. How would you handle class imbalance if Sports has 1000 articles but Politics has 100?
class_weight=balanced in LogisticRegression adjusts weights. Or: upsample minority classes (SMOTE for text uses different approach), downsample majority, or use ensemble methods robust to imbalance.
Q7. What is the difference between macro and weighted F1-score?
Macro F1: average over classes equally (treats all classes same). Weighted F1: weighted by class frequency. Use weighted F1 for imbalanced datasets.
Q8. How would you improve accuracy beyond TF-IDF + LogReg?
Use sentence embeddings (HuggingFace sentence-transformers) instead of TF-IDF. Fine-tune a BERT model on the dataset. TF-IDF ignores word order and meaning; transformers capture both.
Q9. What are the limitations of your approach?
TF-IDF is bag-of-words — ignores word order and context. Does not handle new vocabulary (OOV words). Needs labelled training data. Cannot classify articles in categories not seen during training.
Q10. How would you add a new category to the classifier?
Collect 200+ labelled examples of the new category. Retrain the model with the new class. Re-evaluate all classes — new class may affect existing accuracy.