Image Caption Generator

what you build

An image captioning tool that takes any photo and generates a descriptive caption using the BLIP (Bootstrapping Language-Image Pre-training) model from Salesforce.

Upload any image (JPG, PNG, WebP) and get an automatic caption
Conditional captioning: provide a hint like "a photo of a" to guide the caption
Batch mode: upload multiple images, get a CSV of captions
Download captions as a text file
Display image alongside generated caption in a clean UI

prerequisites

Before You Start

Python basics and file handling
Basic understanding of neural networks
DL course: Multimodal Models topic recommended

architecture

How It Works

🖼 Image Upload
(Streamlit)

→

🔄 PIL Image
Processing

→

🤗 BLIP Model
(HuggingFace)

→

📝 Generated Caption
+ Display

3-week plan

Milestone Breakdown

Week 1

Core Captioning

Load BLIP model from HuggingFace (Salesforce/blip-image-captioning-base)
Process image with BlipProcessor
Generate unconditional and conditional captions
Test with 20 diverse images

Week 2

Streamlit UI + Batch Mode

Build image upload UI with st.image() display
Add conditional caption text input
Implement batch upload with progress bar

Week 3

Polish + Deploy

Add caption download button
Optimise: cache model with @st.cache_resource
Deploy to Hugging Face Spaces (free GPU)

key code

Core Implementation

caption.py

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import streamlit as st

@st.cache_resource
def load_model():
    processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
    model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
    return processor, model

processor, model = load_model()
image = Image.open(uploaded_file).convert("RGB")
inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=50)
caption = processor.decode(out[0], skip_special_tokens=True)

deployment

Deploy to Streamlit Cloud (Free)

deploy.sh

# requirements.txt: transformers torch streamlit Pillow
# Option 1: Streamlit Cloud (CPU, slow first load)
# Option 2: HuggingFace Spaces (free GPU T4)
#   huggingface.co/new-space → Gradio/Streamlit SDK

Create a free account at share.streamlit.io, connect your GitHub repo, and deploy in one click. Your app gets a public URL instantly.

viva prep

10 Viva Questions with Answers

Q1. What is the BLIP model and what was it trained on?

BLIP (Bootstrapping Language-Image Pre-training) from Salesforce. Trained on 129M image-caption pairs from the web. Uses a vision encoder (ViT) + text decoder for caption generation.

Q2. What is the difference between unconditional and conditional captioning?

Unconditional: model generates caption from scratch. Conditional: you provide a text prefix ("a photo of a dog...") that the model completes. Conditional captions are more specific.

Q3. Why use @st.cache_resource for the model?

Model loading takes 5-10 seconds. Without caching, it reloads on every user interaction. cache_resource caches the model object globally — loaded once, shared across all sessions.

Q4. What is PIL and why convert to RGB?

PIL (Pillow) is Python's image library. Some images are RGBA (with alpha channel) or grayscale. BLIP expects 3-channel RGB. Converting to RGB ensures compatibility regardless of input format.

Q5. How does the BLIP model work at a high level?

Vision encoder (ViT) converts the image to patch embeddings. Text decoder (transformer) generates caption tokens autoregressively, attending to both the image embeddings and previous caption tokens.

Q6. What is ViT (Vision Transformer) and how does it differ from CNNs?

ViT splits the image into fixed-size patches, embeds each patch, and processes them with a standard Transformer. CNNs use convolutional filters with local receptive fields. ViT has global attention from the first layer.

Q7. How would you improve caption quality for domain-specific images?

Fine-tune BLIP on domain-specific image-caption pairs. Collect 1000+ domain images with manually written captions. Fine-tune using HuggingFace Trainer on a GPU. 1-2 hours of training is usually sufficient.

Q8. What are the limitations of BLIP-base?

May miss small details, counting is inaccurate (says "a few birds" not "three birds"), struggles with text in images, fails on images very different from training distribution, no spatial reasoning.

Q9. How would you evaluate caption quality automatically?

BLEU score (n-gram overlap with reference caption), ROUGE (recall-based), CIDEr (consensus-based), METEOR (synonym-aware). Compare against human-written reference captions on a test set.

Q10. What is HuggingFace Spaces and why is it better for this project than Streamlit Cloud?

HuggingFace Spaces provides free GPU (T4) for eligible apps. BLIP inference on CPU is slow (~5s/image). On GPU it runs in <1s. Spaces integrates natively with HuggingFace model hub. Better for vision models.