Q1. What is information extraction in NLP?
Automatically identifying and structuring specific information from unstructured text — names, dates, skills, organisations from resume text.
Q2. Why use regex for parsing rather than an LLM?
Regex is deterministic, fast, and free. Email and phone patterns are highly predictable. Use LLMs for complex/ambiguous extraction where rules are insufficient.
Q3. What is Named Entity Recognition (NER)?
NLP task that identifies named entities: PERSON (names), ORG (companies), DATE (years), LOCATION. spaCy provides pre-trained NER models. Useful for extracting employer names and job durations.
Q4. How would you handle resumes in different formats (tables, columns)?
Use pdfplumber instead of PyMuPDF for better layout handling. For complex layouts, OCR with Tesseract. For consistent corporate formats, template matching.
Q5. What is fuzzy matching and when would you use it?
Fuzzy matching (fuzzywuzzy/rapidfuzz) finds similar strings: "ML" matches "Machine Learning", "JS" matches "JavaScript". Reduces missed skills from abbreviations.
Q6. How do you handle false positive skill matches?
Maintain a stop-list (words that contain skill names but aren't skills: "python-like syntax"). Use word boundary matching (\bpython\b) to avoid partial matches.
Q7. What privacy considerations apply to resume parsing?
Resumes contain PII (names, addresses, dates of birth). Under GDPR/PDPA: inform candidates, don't store longer than necessary, implement deletion rights, encrypt stored data.
Q8. How would you improve the scoring algorithm?
Weight skills by importance (senior skills score more), consider years of experience, factor in education relevance, use TF-IDF to weight skill frequency.
Q9. How would you handle multi-page resumes?
PyMuPDF iterates all pages with for page in doc: text += page.get_text(). Multi-page resumes are handled automatically — no special case needed.
Q10. What extension would you add next?
Integrate with job board API (LinkedIn, Naukri) to automatically match parsed resumes against live job postings. Add LLM-based scoring for semantic matching beyond keyword matching.