Mini Project ~2–3 Weeks Beginner Friendly

Image Caption Generator

Build a Streamlit app that generates natural language captions for uploaded images using HuggingFace BLIP model — no GPU needed.

PythonStreamlitHuggingFaceBLIPPILtransformers
Source code + README Milestone breakdown Deployment guide 10 viva Q&A
Start Building → Viva Q&A Learn DL: Multimodal First

An image captioning tool that takes any photo and generates a descriptive caption using the BLIP (Bootstrapping Language-Image Pre-training) model from Salesforce.

  • Upload any image (JPG, PNG, WebP) and get an automatic caption
  • Conditional captioning: provide a hint like "a photo of a" to guide the caption
  • Batch mode: upload multiple images, get a CSV of captions
  • Download captions as a text file
  • Display image alongside generated caption in a clean UI

Before You Start

  • Python basics and file handling
  • Basic understanding of neural networks
  • DL course: Multimodal Models topic recommended

How It Works

🖼 Image Upload
(Streamlit)
🔄 PIL Image
Processing
🤗 BLIP Model
(HuggingFace)
📝 Generated Caption
+ Display

Milestone Breakdown

1
Week 1
Core Captioning
  • Load BLIP model from HuggingFace (Salesforce/blip-image-captioning-base)
  • Process image with BlipProcessor
  • Generate unconditional and conditional captions
  • Test with 20 diverse images
2
Week 2
Streamlit UI + Batch Mode
  • Build image upload UI with st.image() display
  • Add conditional caption text input
  • Implement batch upload with progress bar
3
Week 3
Polish + Deploy
  • Add caption download button
  • Optimise: cache model with @st.cache_resource
  • Deploy to Hugging Face Spaces (free GPU)

Core Implementation

caption.py
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import streamlit as st

@st.cache_resource
def load_model():
    processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
    model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
    return processor, model

processor, model = load_model()
image = Image.open(uploaded_file).convert("RGB")
inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=50)
caption = processor.decode(out[0], skip_special_tokens=True)

Deploy to Streamlit Cloud (Free)

deploy.sh
# requirements.txt: transformers torch streamlit Pillow
# Option 1: Streamlit Cloud (CPU, slow first load)
# Option 2: HuggingFace Spaces (free GPU T4)
#   huggingface.co/new-space → Gradio/Streamlit SDK

Create a free account at share.streamlit.io, connect your GitHub repo, and deploy in one click. Your app gets a public URL instantly.

10 Viva Questions with Answers

Q1. What is the BLIP model and what was it trained on?
BLIP (Bootstrapping Language-Image Pre-training) from Salesforce. Trained on 129M image-caption pairs from the web. Uses a vision encoder (ViT) + text decoder for caption generation.
Q2. What is the difference between unconditional and conditional captioning?
Unconditional: model generates caption from scratch. Conditional: you provide a text prefix ("a photo of a dog...") that the model completes. Conditional captions are more specific.
Q3. Why use @st.cache_resource for the model?
Model loading takes 5-10 seconds. Without caching, it reloads on every user interaction. cache_resource caches the model object globally — loaded once, shared across all sessions.
Q4. What is PIL and why convert to RGB?
PIL (Pillow) is Python's image library. Some images are RGBA (with alpha channel) or grayscale. BLIP expects 3-channel RGB. Converting to RGB ensures compatibility regardless of input format.
Q5. How does the BLIP model work at a high level?
Vision encoder (ViT) converts the image to patch embeddings. Text decoder (transformer) generates caption tokens autoregressively, attending to both the image embeddings and previous caption tokens.
Q6. What is ViT (Vision Transformer) and how does it differ from CNNs?
ViT splits the image into fixed-size patches, embeds each patch, and processes them with a standard Transformer. CNNs use convolutional filters with local receptive fields. ViT has global attention from the first layer.
Q7. How would you improve caption quality for domain-specific images?
Fine-tune BLIP on domain-specific image-caption pairs. Collect 1000+ domain images with manually written captions. Fine-tune using HuggingFace Trainer on a GPU. 1-2 hours of training is usually sufficient.
Q8. What are the limitations of BLIP-base?
May miss small details, counting is inaccurate (says "a few birds" not "three birds"), struggles with text in images, fails on images very different from training distribution, no spatial reasoning.
Q9. How would you evaluate caption quality automatically?
BLEU score (n-gram overlap with reference caption), ROUGE (recall-based), CIDEr (consensus-based), METEOR (synonym-aware). Compare against human-written reference captions on a test set.
Q10. What is HuggingFace Spaces and why is it better for this project than Streamlit Cloud?
HuggingFace Spaces provides free GPU (T4) for eligible apps. BLIP inference on CPU is slow (~5s/image). On GPU it runs in <1s. Spaces integrates natively with HuggingFace model hub. Better for vision models.
🏆

Mark Project Complete

Record your completion and earn your project certificate.