Hi, I'm Gaurav Thorat.

A
Self-driven, quick starter, passionate programmer with a curious mind who enjoys solving complex and challenging real-world problems.

About

I am a Data Scientist with a Master of Science in Data Science from the University of Texas at Arlington and a Bachelor of Engineering in Computer Engineering from Mumbai University.
With over 5 years of experience in the IT industry, I have honed my skills in data analysis, machine learning, natural language processing, and cybersecurity. I have a proven track record of leading and executing complex projects across various domains, including healthcare, finance, and cyber security.
My technical expertise spans a wide range of tools and technologies, including Python, R, Java, SQL, TensorFlow, Hadoop, Tableau, Power BI, and cloud platforms like AWS, Azure, and GCP. I am passionate about leveraging data to drive decision-making and innovation.
My strong analytical skills, coupled with my ability to work effectively in team settings or independently, enable me to deliver impactful solutions to complex problems.

  • Languages: Python, R, SQL, Java, PHP, SAS, JavaScript, C++, GO, Swift.
  • Databases: MySQL, Oracle, MongoDB, MS Access, MS SQL Server, PostgreSQL, NoSQL, DynamoDB,
  • Cloud and Big Data: Amazon Web Services (SageMaker, Lambda, EC2, Glue, RedShift), Microsoft Azure (DataBricks, ML Service, HDInsight), Google Cloud Platform (Colab, TensorFlow, BigQuery, AutoML, Vertex AI, DataFlow, DataProc), Spark, Scala, Snowflake, Langchain, Hadoop, Kafka, MapReduce, Blockchain.
  • Frameworks:Generative AI, Large Language Models (LLM), Deep Learning, Neural Networks, Flask, Django, Tableau, Keras, TensorFlow, PyTorch, Bootstrap, Power BI, ELK Stack, MLFlow, Prometheus, Grafana, Git/GitHub, Docker, Kubernetes, Terraform, Jenkins, CI/CD Pipelines
  • Applications: Agile Methodologies, Exploratory Data Analysis, Web Scraping, Predictive Modelling, Quantitative Analysis, Deep Learning, AI (Artificial Intelligence), Data Mining, Big Data Analysis, Project Management, Predictive Analytics, GIS

Looking for an opportunity to work in a challenging position combining my skills in Machine Learning and Data Science, which provides professional development, interesting experiences and personal growth.

Experience

Data Scientist - Machine Learning Engineer
  • With my previous experience working with the Bank of America and UTA, I’ve been presented the opportunity for working on other projects within the Cyber Security domain which are Analyzing CTI Reports and Deepfake Technology, where I applied technologies like Python, TensorFlow, and Natural Language Processing techniques to analyze LLM applications available open source, resulting increase in model accuracy and efficiency.
  • For the first project of Analyzing CTI Reports, we used NLP, Data Analysis, Python, TRAM, FACTOR, Model Evaluation, Quality and Risk Assessment to determine the health and authenticity of these reports and develop a better detector for it.
  • The main purpose of this project is to determine whether LLM text generators are feeding information to these reports and misleading the truth somehow, since lot of such news cases were noticed within the last few months. For this I led the analysis of various CTI reports using Natural Language Processing (NLP) techniques in Python, along with frameworks like TRAM by MITRE, to assess report authenticity.
  • The TRAM model performed at an accuracy of 40% since its last update, last year. It seems to underperform on these current CTI Reports generated by LLM text. Moving on, employed the FACTOR framework for improved accuracy, achieving a current rate of 55%. Actively fine-tuning the model to enhance its performance further. Able to increase model accuracy by 15% through proxy-tuning and token synthesizing, contributing to more secure and informed decision-making processes.
  • For upgrading purposes, have implemented the Mistral AI model and the TinyLLaMa model with the 7B parameters configuration and trained with CTI Reports proved to be better in the performance for detecting and authenticating CTI Reports with an impressive accuracy of 68% which is a bit lower than of FACTOR but with very few false positives and negatives.
  • For the second project of Deepfake Technology, we utilized Machine Learning, Facial Recognition, Deep Learning, Python, Video Analysis, Model Fine-Tuning, Prototyping, Risk Mitigation for analyzing face detection models over video feeds.
  • Collaborated with a team to develop a software/framework aimed at detecting impostors using Deep Learning technology in video meetings, utilizing Machine Learning and Facial Recognition techniques.
  • Achieved successful model in identifying impostors from meeting screenshots using Python, OpenCV and FaceNet with an impressive accuracy of 95%, laying the groundwork for real-time video feed analysis.
  • Planning to implement the detection mechanism in ongoing video feeds, by utilizing libraries from MTCNN and InsightFace which helps in 2-D and 3-D face analysis and recognition algorithm aiming for a more comprehensive security solution. Enhancing security in video meetings by successfully identifying deepfake impostors, contributing to the protection of sensitive information and increasing user trust in video communication platforms.
  • Skills:Large Language Models (LLM), Cyber Security, CTI Reports, Deepfake Technology, Python, TensorFlow, Natural Language Processing (NLP), Data Analysis, Model Evaluation, Quality and Risk Assessment, Machine Learning, Facial Recognition, Deep Learning, Video Analysis, Model Fine-Tuning, Prototyping, Risk Mitigation, TRAM framework, FACTOR framework, OpenCV library, FaceNet library, MTCNN library, InsightFace library.
Jan 2024 - Present | Remote, United States
Data Scientist
  • Worked on an innovative research project on Large Language Models (LLM) at the University of Texas at Arlington, in collaboration with Bank of America. Utilized Python, TensorFlow, and Natural Language Processing techniques to develop and analyze LLM detectors and text generators, resulting increase in model accuracy and efficiency.
  • Created a comprehensive dataset from scratch using GANs and VAEs algorithms, integrating human and machine-generated text using data collection and advanced preprocessing methods. This foundational work led to a 25% improvement in the training efficiency of Machine Learning models and a 20% enhancement in data quality.
  • Developed a new LLM text generator and detector from scratch with an exceptional accuracy of 86% for the beta version, using LangChain and leveraging machine learning algorithms and Model Optimization techniques. This achievement outperformed existing market benchmarks by 15% and significantly advanced the field of AI-text analysis.
  • Conducted thorough A/B testing analyses between LLM text generators and detectors and available LLMs like Llama and Dolly, employing Statistical Analysis and machine learning metrics. This rigorous evaluation provided a deeper understanding of their performance, influencing the development of more effective LLM applications.
  • Focused on to the details of human vs. AI-generated content, utilizing Sentiment Analysis and content classification methods. This comprehensive analysis shed light on subtle differences, enhancing the interpretability of LLM outputs by 35% and contributing to more authentic content generation.
  • As a result of this capstone project, I was able to contribute to the research in LLM Development and Generative AI. This work contributed to my expertise in the field and played a role in encouraging wider adoption of LLM technologies for text analysis applications.
  • Currently, implementing Perplexity Analysis, to determine the underlying false positives in the previous model and implementing a better version with Fine-tuning and Proxy-tuning the model to gain higher accuracy and creating a more comprehensive dataset for the new models.
  • Skills:Large Language Models (LLM), Python, TensorFlow, Natural Language Processing (NLP), Data Collection, Data Preprocessing, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Machine Learning, Model Optimization, LangChain, A/B Testing, Statistical Analysis, Sentiment Analysis, Content Classification, LLM Development, Generative AI, Perplexity Analysis, Fine-tuning, Proxy-tuning, Retrieval Augmented Generation(RAG).
Sept 2023 - Present | Remote, United States
Data Scientist
  • Developed and maintained a Java and Android Studio application with Keras integration for a wearable smartwatch, increasing user engagement through enhanced health data tracking.
  • Implemented dynamic Python scripts for data processing, and utilized R and Tableau for visualization, creating dynamic dashboards that improved health monitoring efficiency by 20%.
  • Automated data integration using AutoML and synchronization between the wearable device and Azure dashboard, ensuring seamless access to health metrics and reducing manual data handling by 43%.
  • Leveraged Google Cloud for additional data collection and established direct integration with Azure Cloud, centralizing data storage and analysis, leading to a significant reduction in data retrieval times.
  • Conducted Statistical Analyses using Python and R, applying Regression Analysis, Machine Learning, and Predictive Modeling to derive actionable insights, enhancing data-driven decision-making by 35%.
  • Collaborated with a research team to design software applications and tools tailored to specific research needs, focusing on user-centric design, potentially increasing user satisfaction by 20%.
  • Achieved a streamlined and efficient application for comprehensive health data tracking, providing users with a single platform for all health-related information on Azure, resulting in increased overall system reliability.
  • Skills: Java, Android Studio, Kotlin, Keras, Python, Data Processing, R, Tableau, Data Visualization, AutoML, Azure Cloud, Machine Learning, Predictive Modeling, Statistical Analysis, Regression Analysis, Software Development, User-Centric Design.
Sept 2022 - Aug 2023 | Fort Worth, Texas
Lead Data Scientist
  • Over a tenure of more than two years, I contributed to two distinct projects for a Denmark-based client, leveraging a diverse skill set in Machine Learning Technology and Data Analysis.
  • The first project was in the health sector where we utilized Python, Scala, R, Data Visualization, SQL, Cloud Infrastructure, and MLOps to develop a secure and efficient sales application.
  • Developing Generative Models and Large Language Models (LLMs) to analyze sales data and generate personalized sales portal content over Azure, leading to a 15% increase in customer conversion rates.
  • Employing Predictive Models for customer behavior to optimize inventory management and target sales approaches, resulting in a 10% reduction in inventory costs and dead stocks.
  • Also developing machine learning models for sales risk mitigation to ensure robust and reliable sales operations, decreasing sales-related risks by 20% and establishing metrics and tools for evaluating the safety and robustness of the platform's AI components.
  • Executing production-quality code in Python, integrating ML/DL frameworks such as TensorFlow for model development and deployment, leading to a 30% improvement in model deployment efficiency.
  • In the second project, we tackled challenges in the finance sector where we employed Python, SQL, Data Visualization, Machine Learning, Blockchain, Cyber Security, Cryptography and Predictive Analytics for advanced fraud detection algorithms.
  • As the use of Blockchain and Cryptography is booming we designed new cloud data structure for data storage with a firewall which resulted in a 25% reduction in fraudulent activities and a 15% increase in the accuracy of anomaly detection.
  • Implemented cloud architecture helped reduce the need for on-site equipment and technical support, resulting in a 40% decrease in infrastructure costs. Ensured a 99.9% uptime and maximized revenue and customer satisfaction by implementing a LLM based AI-bot and ensured its consistency through rigorous quality control and compliance with security standards.
  • Developed numerous predictive models and provided analytical insights that improved product performance by 35% resulting in increase in the company's revenue by 20%
  • Enhanced decision-making processes with real-time business analytics and visualizations for internal customers by monitoring constant growth which helped in analyzing future improvements in the application. Led a team of 15 Python developers and 10 Data Analysts for 5 different projects over my time, resulting in company stocks to rise by 15% and creating a forefront for ML Developments.
  • Skills: Machine Learning, Data Analysis, Python, Scala, R, Data Visualization, SQL, Cloud Infrastructure, MLOps, Generative Modeling, Large Language Modeling (LLMs), Predictive Modeling, Sales Optimization, Inventory Management, Risk Mitigation, TensorFlow, Blockchain, Cyber Security, Cryptography, Fraud Detection, Cloud Architecture, AI Chatbot Development, Quality Control, Revenue Growth, Business Analytics
Aug 2019 - Nov 2021 | Mumbai, India
Data Scientist
  • Efficiently orchestrated 2-3 Jenkins pipelines to support Continuous Integration and Continuous Delivery (CI/CD) processes, optimizing business process automation and streamlining software development workflows. Contributed to the seamless integration of CI/CD practices, resulting in enhanced development efficiency and faster delivery of software solutions.
  • Executed 10-20 weekly SQL database tasks, including the creation and execution of real-time data report queries, ensuring accurate data retrieval and analysis. Played a key role in data management, providing valuable insights to support decision-making processes within the organization.
  • Provided leadership in managing 4-6 diverse deployment requests every week, encompassing event handling, Informatica workflows, Autosys jobs, Oracle Forms and Reports, and OpenShift deployments. Ensured efficient and timely execution of projects, maintaining a high standard of project delivery and client satisfaction.
  • Collaborated within an Agile framework, working closely with leads and managers as an operations engineer for 10-30 weekly procedural changes. Adapted to evolving project requirements and contributed to the agility and responsiveness of the development process.
  • Implemented automation processes using Blue Prism and Robotic Process Automation (RPA) technologies, resulting in a substantial saving of 10 weekly manual reporting hours. Notably improved client relations by ensuring accurate and timely reporting, enhancing overall client satisfaction.
  • Successfully orchestrated the migration of a vast database, encompassing over 1,000,000 records, showcasing technical expertise in managing large-scale data transitions. Executed the migration seamlessly, ensuring data integrity and minimal disruption to operations.
  • Proficiently resolved 3-5 daily client-IT issues through systematic root cause analysis (RCA) and swift troubleshooting, ensuring operational stability and minimizing downtime. Demonstrated exceptional problem-solving skills and a commitment to delivering effective solutions to client challenges.
  • Developed comprehensive support documentation, facilitating training for 5 teams across the organization. Contributed to a 20% reduction in resolution time by equipping teams with the necessary resources and knowledge to address issues efficiently.
  • Skills: Python, Agile, Automation, Autosys, Blue Prism, CI/CD, Client Satisfaction, Cloud Migration, Data Analysis, Data Management, Database Management, DevOps, Documentation, Event Handling, Informatica, Jenkins, Leadership, OpenShift, Oracle Forms and Reports, Problem Solving, Process Improvement, Project Management, Reporting, RPA, Root Cause Analysis (RCA), SQL, Troubleshooting, Software Development, System Administration, Team Work.
Feb 2018 - May 2019 | Mumbai, India
Data Analyst
  • Conducted in-depth data analysis and consistently fulfilled ad-hoc requests from various stakeholders.
  • Collaborated effectively with cross-functional teams, optimizing data-driven decision-making processes and boosting overall productivity by a notable 20%.
  • Executed rigorous unit testing procedures, comprising over 100 test cases, to validate data precision and software functionality.
  • Achieved a remarkable 50% reduction in production bugs by implementing comprehensive testing methodologies and proactive debugging strategies.
  • Proactively monitored system logs to ensure the seamless operation of scheduled automated interfaces, including Autosys-driven processes, SQL queries, and ETL (Extract, Transform, Load) operations.
  • Implemented optimization measures to enhance system efficiency and minimize downtime.
  • Demonstrated expertise in UNIX shell scripting, enhancing 2-5 shell scripts per week to automate critical tasks.
  • Scripted processes for triggering Oracle packages, facilitating efficient data import and export operations, and automating email reporting, streamlining essential workflows.
  • Skills: Data Analysis, Ad-hoc Reporting, Cross-functional Collaboration, Data-driven Decision Making, Unit Testing, Software Testing, Bug Reduction, System Monitoring, System Optimization, UNIX Shell Scripting, Oracle Packages, Data Import/Export, Email Automation, ETL (Extract, Transform, Load)
Nov 2016 - Dec 2017 | Thane, India

Projects

sentiment analysis
Twitter Sentiment Analysis

Sentiment Analysis application developed in Python.

Accomplishments
  • Tools: Python, HTML/CSS, Bootstrap, SQLite, AWS S3
  • Microblogging today has become a very popular communication tool among Internet users. Millions of messages are appearing daily in popular web-sites that provide services for microblogging such as Twitter, Tumblr, Facebook.
  • Authors of those messages write about their life, share opinions on variety of topics and discuss current issues.
  • Because of a free format of messages and an easy accessibility of microblogging platforms, Internet users tend to shift from traditional communication tools (such as traditional blogs or mailing lists) to microblogging services.
  • As more and more users post about products and services they use, or express their political and religious views, microblogging web-sites become valuable sources of people’s opinions and sentiments. Such data can be efficiently used for marketing or social studies.
Screenshot of web app
Wine Quality Prediction

Predicting the Quality of Red Wine using Machine Learning Algorithms for Regression Analysis, Data Visualizations and Data Analysis.

Accomplishments
  • Tools: Python, ML Framework, Tableau, Jupyter, R Studio
  • Datasets are taken in consideration based on past quarter manufacturing activity
  • The quality predicted is in the range from 1 to be highest and 10 to be lowest quality.
Screenshot of  web app
Women Safety App

An android based application with GPS support for safety of womens in emergency situations.

Accomplishments
  • An android app which can INSTANTLY alert the Guardians( along with user location ) whenever the user is in an emergency situation. It can be triggered just by shaking the android device in which the app is installed.
Screenshot of  web app
Smart Parking System

Smart Parking App project is a mobile application which is developed in Android platform.

Accomplishments
  • SmartPark system is made up of interchangeable components and fully integrates parking, guidance, payment and analytics as well as a host of other complementary services and options.
Screenshot of app
Leaf-Analysis-And-Prediction

An image detector and analyzer based on the concept of RNN developed in Keras

Accomplishments
  • The Leaf Analysis and Prediction exhibits, a system analysing all categories of leaves and predicting its possible health conditions determining its type, age, habitat, and enivironment in which it grows.
Screenshot of Hardware
Personal Virtual Assistant

The project worked as a Chatbot on Raspberry PI, which was totally operated with speech synchronisation and online support ensuring less errors.

Accomplishments
  • The product was handy, smaller than mobile phones and like Google Mini or Alexa Echo but would work on lower bandwidth and can be used while travelling too.

Skills

Languages and Databases


Python

R

JavaScript

HTML/CSS

MySQL

PostgreSQL

Oracle

Libraries


NumPy

Pandas

OpenCV

Scikit-Learn

Matplotlib

ETL

Frameworks


Tableau

Power BI

Hadoop

Bootstrap

Keras

TensorFlow

PyTorch

Applications


Git

AWS

Azure

GCP

Docker

Certificates

Certified Data Scientist (IBM)
Certified Data Science Professional (John Hopkins University)
Certified Data Analyst (Google)
Certified Cloud Practitioner (AWS)

Education

University of Texas at Arlington

Arlington, TX, USA

Degree: Master of Science in Data Science
CGPA: 3.2/4.0

    Relevant Courseworks:

    • Advanced DBMS
    • Cloud Computing
    • Data Science
    • Machine Learning
    • Big Data Analysis
    • Foundations of Computing
    • Probability and Statistics
    • Advanced Engineering Economy
    • Data Mining
    • Applied Linear Regression

Shivajirao S Jondhale College of Engineering, Dombivli, Mumbai University

Mumbai, India

Degree: Bachelor of Engineering in Computer Engineering
CGPA: 3.7/4.0

    Relevant Courseworks:

    • Data Structures and Algorithms
    • Database Management Systems
    • Operating Systems
    • Machine Learning
    • Computer Vision

Contact

Blogs

titanic disaster
Titanic Disaster Survival Prediction System

Machine Learning from Disaster

Information
  • About: Titanic sank on April 14–15, 1912
  • Challenge: The objective of this competition is to learn from a given trained data set to predict who would survive or perish in the Titanic disaster
  • Submission: Original score was 77%
  • Contribution: My improved score is 78%
image classifier
Facial Emotion Expressions

Image Classifier using CNN for facial emotion expressions dataset

Information
  • About: An image classification task for facial expressions using CNN to gain maximum accuracy
  • Challenge: The objective of this assignment is to learn from a given data set to train the model and make it predictable to classify the images as per there emotional bases.
  • Basic: Code was executing and working accuracy of 80% with overfitting
  • Contribution: Enhanced the accuracy to 90% and above with reducing training dataset and adding regularization.
text classifier
Rotten Tomatoes Reviews

Text classifier using Naive Bayes Classifier build from scratch

Information
  • About: A text classification task for rotten tomatoes reviews as fresh and rotten using NBC to gain maximum accuracy
  • Challenge: The objective of this assignment is to learn from a given data set to train the model and make it predictable to classify the texts as per there Freshnesses.
  • Basic: Implementing a Naive Bayes Classifier from Scratch and working on it to perform classification on given dataset
  • Contribution: Enhanced the accuracy to 85% and above with hyperparameter tuning and additive smoothing.
img classifier
Leaf Disease Classification

Image classifier using CNN, SVM, Decision Trees, Naive Bayes Classifier, Random Forest Classifier and kNN used to classify the dataset and compared accordingly.

Information
  • About: A image classification task for leaf diseases as Healthy and Diseased and which disease has occurred using NBC, kNN, SVM, CNN and others to gain maximum accuracy amongst them.
  • Challenge: The objective of this assignment is to learn from a given data set to train the model and make it predictable to classify the images as per there disease category.
  • Basic: Implementing a Classification task from beginning and working on it to perform classification on given dataset
  • Contribution: Enhanced the accuracy to 93% on CNN, comapring it various other classifiers with hyperparameter tuning.