Q1. Why use SQLite for storage instead of a CSV file?
SQLite supports concurrent reads, ACID transactions, SQL queries for aggregation, and handles large datasets efficiently. CSV requires loading everything into memory for every query.
Q2. How do you prevent the LLM from returning an invalid category?
Validate: if response not in CATEGORIES, default to "Other". Add few-shot examples in the prompt showing correct category selection. Use response_format=json_object for structured output.
Q3. What is the cost of categorising 100 expenses with gpt-4o-mini?
Each categorisation: ~50 input tokens + 5 output tokens. 100 expenses: ~5500 tokens total. At $0.15/1M input: approximately $0.001. Very cheap for personal use.
Q4. How would you make this work without an OpenAI API key?
Use a free local model via Ollama (llama3.2:1b). Or use a rule-based fallback: keyword matching (coffee→Food, uber→Transport). Or HuggingFace zero-shot classification.
Q5. How does your app handle multi-user scenarios?
SQLite is single-writer — fine for personal use. For multi-user: PostgreSQL (Supabase), add user_id column, Supabase auth for per-user data isolation.
Q6. What data visualisations did you implement and why?
Pie chart for category breakdown (proportion), bar chart for monthly comparison (trend), line chart for daily spending (pattern). Each tells a different story about spending behavior.
Q7. How would you add budget tracking?
Budget table in SQLite: category, monthly_limit. Compare actual spending per category to limit. Show traffic light indicator (green/amber/red) based on % of budget used.
Q8. What are the security considerations for storing financial data?
Encrypt the SQLite database (sqlcipher). Never log transaction descriptions to console. Use Supabase RLS if multi-user. Don't include API keys in source code (use environment variables/secrets).
Q9. How would you test this application?
Unit tests for categorise_expense (mock OpenAI API), integration tests for SQLite operations, UI tests with Streamlit's testing framework. Test edge cases: duplicate entries, very large amounts, special characters.
Q10. What ML techniques could replace the LLM for categorisation?
Train a text classifier (TF-IDF + LogReg) on a labelled expense dataset. Rule-based with keyword matching + regex. Embedding similarity to category descriptions. LLM is easiest but costliest.