Mini Project ~2–3 Weeks Beginner Friendly

Resume Parser & Scorer

Extract structured data from PDF resumes using PyMuPDF and regex. Score resumes for specific skills and visualise candidate profiles.

PythonPyMuPDFspaCyStreamlitpandasre
Source code + README Milestone breakdown Deployment guide 10 viva Q&A
Start Building → Viva Q&A Learn GenAI First

A tool that automatically extracts contact info, skills, experience, and education from PDF resumes and scores them for a given role.

  • Extract text from PDF resumes using PyMuPDF
  • Parse sections: contact info, skills, work experience, education
  • Score against a customisable skills list (0-100)
  • Visual profile card per candidate with skill match percentage
  • Batch processing: upload multiple resumes, get ranked leaderboard

Before You Start

  • Python basics (regex, file handling)
  • Basic NLP understanding (tokenisation, entity extraction)
  • Streamlit familiarity helpful

How It Works

📄 PDF Upload
🔍 Text Extraction
(PyMuPDF)
🧠 NLP Parsing
(regex + spaCy NER)
⭐ Skill Scoring
📊 Profile Cards

Milestone Breakdown

1
Week 1
PDF Extraction + Section Parsing
  • Extract text from PDF using fitz.open()
  • Use regex to find email, phone number, LinkedIn URL
  • Identify section headers (SKILLS, EXPERIENCE, EDUCATION)
2
Week 2
Skill Matching + Scoring
  • Build a skills vocabulary list (Python, SQL, etc.)
  • Match skills using keyword search + fuzzy matching
  • Calculate score: matched_skills / required_skills * 100
3
Week 3
Streamlit UI + Deploy
  • Build profile card UI for each parsed resume
  • Add batch upload with ranked leaderboard
  • Deploy to Streamlit Cloud

Core Implementation

parser.py
import fitz, re

SKILLS_LIST = ["python", "sql", "machine learning", "tensorflow",
               "pandas", "git", "docker", "javascript"]

def parse_resume(pdf_path: str) -> dict:
    text = "".join(page.get_text() for page in fitz.open(pdf_path))
    email  = re.findall(r'[\w.-]+@[\w.-]+\.\w+', text)
    skills = [s for s in SKILLS_LIST if s in text.lower()]
    score  = round(len(skills) / len(SKILLS_LIST) * 100)
    return {"email": email[0] if email else "",
            "skills": skills, "score": score}

Deploy to Streamlit Cloud (Free)

deploy.sh
# requirements.txt: pymupdf streamlit pandas
# share.streamlit.io → New app → your repo → Deploy

Create a free account at share.streamlit.io, connect your GitHub repo, and deploy in one click. Your app gets a public URL instantly.

10 Viva Questions with Answers

Q1. What is information extraction in NLP?
Automatically identifying and structuring specific information from unstructured text — names, dates, skills, organisations from resume text.
Q2. Why use regex for parsing rather than an LLM?
Regex is deterministic, fast, and free. Email and phone patterns are highly predictable. Use LLMs for complex/ambiguous extraction where rules are insufficient.
Q3. What is Named Entity Recognition (NER)?
NLP task that identifies named entities: PERSON (names), ORG (companies), DATE (years), LOCATION. spaCy provides pre-trained NER models. Useful for extracting employer names and job durations.
Q4. How would you handle resumes in different formats (tables, columns)?
Use pdfplumber instead of PyMuPDF for better layout handling. For complex layouts, OCR with Tesseract. For consistent corporate formats, template matching.
Q5. What is fuzzy matching and when would you use it?
Fuzzy matching (fuzzywuzzy/rapidfuzz) finds similar strings: "ML" matches "Machine Learning", "JS" matches "JavaScript". Reduces missed skills from abbreviations.
Q6. How do you handle false positive skill matches?
Maintain a stop-list (words that contain skill names but aren't skills: "python-like syntax"). Use word boundary matching (\bpython\b) to avoid partial matches.
Q7. What privacy considerations apply to resume parsing?
Resumes contain PII (names, addresses, dates of birth). Under GDPR/PDPA: inform candidates, don't store longer than necessary, implement deletion rights, encrypt stored data.
Q8. How would you improve the scoring algorithm?
Weight skills by importance (senior skills score more), consider years of experience, factor in education relevance, use TF-IDF to weight skill frequency.
Q9. How would you handle multi-page resumes?
PyMuPDF iterates all pages with for page in doc: text += page.get_text(). Multi-page resumes are handled automatically — no special case needed.
Q10. What extension would you add next?
Integrate with job board API (LinkedIn, Naukri) to automatically match parsed resumes against live job postings. Add LLM-based scoring for semantic matching beyond keyword matching.
🏆

Mark Project Complete

Record your completion and earn your project certificate.