Hi, I am Gaurav Pendharkar!
a data scientist driven by a passion for building interpretable, reliable, and deployable ML systems using domain-specific data.
Recent Posts
About Me
My name is Gaurav Pendharkar. I am a data scientist with 1.5 years of experience developing machine learning pipelines for practical applications across various domains, including law, healthcare, earth sciences, and aviation. I have expertise in managing diverse data sources, including structured data (tables), semi-structured data (JSON, XML), and unstructured data (text, images, and PDFs). My focus is on building explainable, reliable machine learning systems through transparent modeling choices and rigorous evaluation in real-world environments.
Experience
- Collaborated with cross-functional team of five data and soil scientists to design and implement an interpretable ML pipeline for estimating soil pH and soil organic matter, enabling data-driven soil health assessment.
- Developed a rocky terrain binary classifier based on topographic and vegetation indices to gate downstream regression models; improved baseline macro-avg recall by 30% (from 0.557 to 0.723) via tree-based models.
- Orchestrated a GenAI workflow for soil pH regression leveraging chain-of-thought prompting with Gemini 2.5 Flash on Vertex AI Batch Inference; increased R2 score to 0.129, about 4x of a tree-based baseline.
- Built a multilingual rich text editor to quantify human–AI coauthorship via AI suggestion acceptance rates.
- Engineered FastAPI based model-serving APIs for three NLP models: GPT2, IndicTrans, and IndicXlit; powering a web app leveraged by study participants during controlled writing experiments.
- Collected keystroke level interaction logs; observed a 38% acceptance rate, indicating limited reliance on GPT2.
- Develop an information extraction pipeline converting 1300+ unstructured Indian court records into structured data, enabling predictive modeling for faculty and researchers.
- Expanded a legal document repository by 205% (455 to 1388) through web automation on Manupatra database.
- Fine-tuned the LAW entity in a generic spaCy NER model, doubling F1-score from 0.40 to 0.83, and combined pattern-based rules with ML-based NER for Indian legal PDFs.
- Reduced manual work by 99.8% (from 3 months to 12 hours) while maintaining approximately 94% accuracy.
Recent Projects
View all projects
Explainability Driven Chain-of-Thought Prompting
Automated reasoning for CoT prompting using explainability attributes from tree-based models for binary classification on tabular datasets.
Daily Sales Forecasting
Forecasting daily total sales of different gifting items using holiday data, promotional sales data , and other time-series features.
On-time Performance Analysis of NYC domestic flights
On-time performance analysis of domestic flights from NYC airports for the year 2023.
Arrival Delay Prediction for US domestic flights
Multiclass classification of arrival delays for NYC domestic flights using tree-based models.
Illumination Invariant Tiger Detection
Automating detecting tigers in the wild by handling illumination issues with the help of EnlightenGAN.
Imbalanced Malware Byteplot Image Classification
Assessing the impact of class imbalance on model performance and convergence for malware byteplot image classification.
My Skills
Education