Available for opportunities

Dhinakar
Yalla

Data Engineer with a Master's in Data Science. I build production ETL pipelines, improved Spark job throughput by 55% through Delta Lake optimization, and reduced query latency by 40% in a production ML pipeline — with hands-on experience at Databricks and Fractal Analytics.

Dhinakar Yalla
40%
Query latency reduced
55%
Spark throughput gained
2TB+
Daily data processed
IEEE
Published research

Technical skills

⭐ AWS Certified Data Engineer
Languages
PythonSQLPySparkC++R
Data Engineering
ETL/ELT PipelinesDelta LakeData ModelingStar SchemaSnowflake SchemaQuery Optimizationdbt
Big Data & Orchestration
Apache SparkDatabricksApache AirflowKafkaHadoopApache Flink
Cloud & Databases
AWS S3EC2RedshiftGlueSnowflakePostgreSQLMongoDBRedis
Analytics & BI
Power BITableauPandasNumPyMatplotlib
DevOps & ML
DockerKubernetesGitTerraformscikit-learnPyTorchTensorFlowMLflowNLP

Experience

Databricks
Data Engineering Intern
Jan 2025 – Jun 2025
Remote, USA
  • Built Delta Lake-based ELT pipelines processing 2TB+ of daily event data using PySpark and Databricks Workflows — improved data freshness SLAs from 4 hours to under 45 minutes.
  • Optimized Spark job performance with Z-order and liquid clustering, reducing BI dashboard query scan time by 55%.
  • Designed a data quality monitoring framework using Great Expectations integrated into CI/CD pipelines, catching schema drift and null violations before production.
  • Containerized pipeline jobs with Docker and deployed onto Kubernetes clusters, cutting deployment setup time by 65% and enabling full environment parity across dev, staging, and production.
Fractal Analytics
Data Engineering Intern
Aug 2023 – Feb 2024
Chennai, India
  • Built and maintained ELT pipelines ingesting structured and semi-structured data from 10+ client sources into Snowflake, reducing manual handoff time by 45%.
  • Designed Spark-based batch processing jobs for 500GB+ datasets, improving job completion time by 30% through partition pruning and broadcast join optimization.
  • Engineered 15+ features from raw transactional data, reducing feature computation latency by 20% and improving downstream model quality.
  • Automated pipeline monitoring with Apache Airflow DAGs, cutting mean time to resolution by 40%.
1Stop.ai
Data Science Intern
Feb 2023 – Jul 2023
Remote, India
  • Built Python and SQL ETL pipelines to automate ingestion from multiple data sources, cutting manual processing time by 35%.
  • Integrated AWS S3 and EC2 into pipeline workflows, reducing data transfer overhead and improving end-to-end throughput.
  • Delivered Power BI and Tableau dashboards tracking 5+ KPIs in real time, adopted by business teams for weekly stakeholder reporting.

Projects

01
City Bike Price Prediction

End-to-end ETL and ML pipeline across MySQL, Snowflake, and AWS (S3, EC2). Reduced query latency by 40% through star-schema design and warehouse optimization. Improved prediction accuracy by 25% by catching bad records before model training. Containerized with Docker for one-command deployment.

MySQLSnowflakeAWSDockerML
⌥ View on GitHub ↗
02
AI-Powered Pneumonia Detection

Deep learning inference pipeline using ResNet50 and CLIP for medical image classification. Added Grad-CAM explainability and automated structured report generation for clinical workflows. Deployed via Streamlit for real-time inference, replacing a manual review process.

ResNet50CLIPGrad-CAMStreamlitPyTorch
⌥ View on GitHub ↗
03
NYC Taxi Trip Analytics

Large-scale data engineering pipeline built on NYC's TLC taxi trip dataset. Ingested and processed millions of trip records using PySpark, applied geospatial and temporal feature engineering, and built an analytics layer to surface insights on trip patterns, demand hotspots, and fare trends across NYC boroughs.

PySparkPythonSQLAWSAnalytics
⌥ View on GitHub ↗

Publications

Cancer Category Classification Using BiLSTM
IEEE ICICIT 2024 — Peer Reviewed & Published
Read on IEEE Xplore ↗

Engineered an NLP pipeline processing 10,000+ clinical records using TF-IDF feature engineering and BiLSTM architectures, achieving a 0.998 F1-score. Benchmarked multiple deep learning models with rigorous hyperparameter tuning to maximize classification reliability and robustness. Findings validated through peer review and published at IEEE ICICIT 2024.

BiLSTMTF-IDFNLPTensorFlowClinical NLPIEEE

Education

MS, Engineering Science — Data Science
University at Buffalo, SUNY
Aug 2024 – Dec 2025 · Buffalo, NY
BTech, Computer Science and Engineering
Karunya Institute of Technology and Sciences
Aug 2020 – May 2024 · India

Get in touch

yalladhinakar@gmail.com