Resume Parser & Scorer

what you build

A tool that automatically extracts contact info, skills, experience, and education from PDF resumes and scores them for a given role.

Extract text from PDF resumes using PyMuPDF
Parse sections: contact info, skills, work experience, education
Score against a customisable skills list (0-100)
Visual profile card per candidate with skill match percentage
Batch processing: upload multiple resumes, get ranked leaderboard

prerequisites

Before You Start

Python basics (regex, file handling)
Basic NLP understanding (tokenisation, entity extraction)
Streamlit familiarity helpful

architecture

How It Works

📄 PDF Upload

→

🔍 Text Extraction
(PyMuPDF)

→

🧠 NLP Parsing
(regex + spaCy NER)

→

⭐ Skill Scoring

→

📊 Profile Cards

3-week plan

Milestone Breakdown

Week 1

PDF Extraction + Section Parsing

Extract text from PDF using fitz.open()
Use regex to find email, phone number, LinkedIn URL
Identify section headers (SKILLS, EXPERIENCE, EDUCATION)

Week 2

Skill Matching + Scoring

Build a skills vocabulary list (Python, SQL, etc.)
Match skills using keyword search + fuzzy matching
Calculate score: matched_skills / required_skills * 100

Week 3

Streamlit UI + Deploy

Build profile card UI for each parsed resume
Add batch upload with ranked leaderboard
Deploy to Streamlit Cloud

key code

Core Implementation

parser.py

import fitz, re

SKILLS_LIST = ["python", "sql", "machine learning", "tensorflow",
               "pandas", "git", "docker", "javascript"]

def parse_resume(pdf_path: str) -> dict:
    text = "".join(page.get_text() for page in fitz.open(pdf_path))
    email  = re.findall(r'[\w.-]+@[\w.-]+\.\w+', text)
    skills = [s for s in SKILLS_LIST if s in text.lower()]
    score  = round(len(skills) / len(SKILLS_LIST) * 100)
    return {"email": email[0] if email else "",
            "skills": skills, "score": score}

deployment

Deploy to Streamlit Cloud (Free)

deploy.sh

# requirements.txt: pymupdf streamlit pandas
# share.streamlit.io → New app → your repo → Deploy

Create a free account at share.streamlit.io, connect your GitHub repo, and deploy in one click. Your app gets a public URL instantly.

viva prep

10 Viva Questions with Answers

Q1. What is information extraction in NLP?

Automatically identifying and structuring specific information from unstructured text — names, dates, skills, organisations from resume text.

Q2. Why use regex for parsing rather than an LLM?

Regex is deterministic, fast, and free. Email and phone patterns are highly predictable. Use LLMs for complex/ambiguous extraction where rules are insufficient.

Q3. What is Named Entity Recognition (NER)?

NLP task that identifies named entities: PERSON (names), ORG (companies), DATE (years), LOCATION. spaCy provides pre-trained NER models. Useful for extracting employer names and job durations.

Q4. How would you handle resumes in different formats (tables, columns)?

Use pdfplumber instead of PyMuPDF for better layout handling. For complex layouts, OCR with Tesseract. For consistent corporate formats, template matching.

Q5. What is fuzzy matching and when would you use it?

Fuzzy matching (fuzzywuzzy/rapidfuzz) finds similar strings: "ML" matches "Machine Learning", "JS" matches "JavaScript". Reduces missed skills from abbreviations.

Q6. How do you handle false positive skill matches?

Maintain a stop-list (words that contain skill names but aren't skills: "python-like syntax"). Use word boundary matching (\bpython\b) to avoid partial matches.

Q7. What privacy considerations apply to resume parsing?

Resumes contain PII (names, addresses, dates of birth). Under GDPR/PDPA: inform candidates, don't store longer than necessary, implement deletion rights, encrypt stored data.

Q8. How would you improve the scoring algorithm?

Weight skills by importance (senior skills score more), consider years of experience, factor in education relevance, use TF-IDF to weight skill frequency.

Q9. How would you handle multi-page resumes?

PyMuPDF iterates all pages with for page in doc: text += page.get_text(). Multi-page resumes are handled automatically — no special case needed.

Q10. What extension would you add next?

Integrate with job board API (LinkedIn, Naukri) to automatically match parsed resumes against live job postings. Add LLM-based scoring for semantic matching beyond keyword matching.