Data Engineer with a Master's in Data Science. I build production ETL pipelines, improved Spark job throughput by 55% through Delta Lake optimization, and reduced query latency by 40% in a production ML pipeline — with hands-on experience at Databricks and Fractal Analytics.
End-to-end ETL and ML pipeline across MySQL, Snowflake, and AWS (S3, EC2). Reduced query latency by 40% through star-schema design and warehouse optimization. Improved prediction accuracy by 25% by catching bad records before model training. Containerized with Docker for one-command deployment.
⌥ View on GitHub ↗Deep learning inference pipeline using ResNet50 and CLIP for medical image classification. Added Grad-CAM explainability and automated structured report generation for clinical workflows. Deployed via Streamlit for real-time inference, replacing a manual review process.
⌥ View on GitHub ↗Large-scale data engineering pipeline built on NYC's TLC taxi trip dataset. Ingested and processed millions of trip records using PySpark, applied geospatial and temporal feature engineering, and built an analytics layer to surface insights on trip patterns, demand hotspots, and fare trends across NYC boroughs.
⌥ View on GitHub ↗Engineered an NLP pipeline processing 10,000+ clinical records using TF-IDF feature engineering and BiLSTM architectures, achieving a 0.998 F1-score. Benchmarked multiple deep learning models with rigorous hyperparameter tuning to maximize classification reliability and robustness. Findings validated through peer review and published at IEEE ICICIT 2024.