Q1. What is the BLIP model and what was it trained on?
BLIP (Bootstrapping Language-Image Pre-training) from Salesforce. Trained on 129M image-caption pairs from the web. Uses a vision encoder (ViT) + text decoder for caption generation.
Q2. What is the difference between unconditional and conditional captioning?
Unconditional: model generates caption from scratch. Conditional: you provide a text prefix ("a photo of a dog...") that the model completes. Conditional captions are more specific.
Q3. Why use @st.cache_resource for the model?
Model loading takes 5-10 seconds. Without caching, it reloads on every user interaction. cache_resource caches the model object globally — loaded once, shared across all sessions.
Q4. What is PIL and why convert to RGB?
PIL (Pillow) is Python's image library. Some images are RGBA (with alpha channel) or grayscale. BLIP expects 3-channel RGB. Converting to RGB ensures compatibility regardless of input format.
Q5. How does the BLIP model work at a high level?
Vision encoder (ViT) converts the image to patch embeddings. Text decoder (transformer) generates caption tokens autoregressively, attending to both the image embeddings and previous caption tokens.
Q6. What is ViT (Vision Transformer) and how does it differ from CNNs?
ViT splits the image into fixed-size patches, embeds each patch, and processes them with a standard Transformer. CNNs use convolutional filters with local receptive fields. ViT has global attention from the first layer.
Q7. How would you improve caption quality for domain-specific images?
Fine-tune BLIP on domain-specific image-caption pairs. Collect 1000+ domain images with manually written captions. Fine-tune using HuggingFace Trainer on a GPU. 1-2 hours of training is usually sufficient.
Q8. What are the limitations of BLIP-base?
May miss small details, counting is inaccurate (says "a few birds" not "three birds"), struggles with text in images, fails on images very different from training distribution, no spatial reasoning.
Q9. How would you evaluate caption quality automatically?
BLEU score (n-gram overlap with reference caption), ROUGE (recall-based), CIDEr (consensus-based), METEOR (synonym-aware). Compare against human-written reference captions on a test set.
Q10. What is HuggingFace Spaces and why is it better for this project than Streamlit Cloud?
HuggingFace Spaces provides free GPU (T4) for eligible apps. BLIP inference on CPU is slow (~5s/image). On GPU it runs in <1s. Spaces integrates natively with HuggingFace model hub. Better for vision models.