what is a DATA SCIENTIST™ and what do they do?
The origins of AI Engineers, Machine Learning Engineers, and more
TL;DR they look at data real good
Intro
Long ago, in the age before LLMs (before everyone’s grandma’s second cousin’s dog was an AI expert), there were statisticians, engineers, and business analysts who lived in harmony (kind of). Then, everything changed when big data attacked and only the DATA SCIENTIST™, master of all three areas, could handle it.
I remember the exact moment I started to get into data science, machine learning (ML)/AI, etc. I remember exactly where I was and what I was doing; Trump was winning the 2016 election and I was studying for a linear algebra test. I remember a girl in the study room at the time crying because he got elected. My first thought was “Ain’t no way this is real…”
At the time, the one statistician I knew, Nate Silver, nailed the 2012 election (correctly predicted the outcome of 49/50 states) and somehow got the 2016 election dead wrong. I remember thinking “how in the world did he fuck up this bad” and my next thought being “…I think I could do this better”. I was going to become what Harvard Business review called “the sexiest job of the 21st century”: a Data Scientist.
Needless to say, 9 years, a few degrees and jobs later, I understand what my boy Nate Silver was going through.
What’s a DATA SCIENTIST™?
So what the hell is a data scientist? At its core, it’s simply just someone who uses data to help people/businesses solve problems. The way I was marketed the role back in the day was that it sat at the intersection of Statistics, Math, and Computer Science and my nerdy, indecisive self loved that.
People used to call DATA SCIENTIST™ the “sexiest job of the 21st century” (because everyone knows the sexiest job is working with dirty, dirty data all day long) but back in the ancient days (pre 2012ish), there were other titles: Statistician, Data/Research/Quantitative/Business Analyst, Actuary, etc. that were rough equivalents of the that neat little title. So definitely a bit of extra marketing there.
But even in those days, as it is now, most of the “data science” was mostly just cleaning and preparing the data before letting the ML do its thing. For those of you unfamiliar, that meant:
Collecting data from different sources
Merging inconsistent datasets
Handling missing values, outliers, and duplicates
Fixing schema or formatting errors
Transforming raw data into usable features
(e.g. calculating a “length of membership” feature usingcustomer_end_date - customer_start_date)
And that is by no means an exhaustive list. (Obviously you could just shove as much data as you want into the ML algo without cleaning it but it goes back to the age old adage: Garbage In, Garbage Out.)
Ok cool, so you have your data all “clean”. That alone is like 80% of the job. Now what are you gonna do with it?
Well it turns out you can do a whole bunch of things through feeding it into the right machine learning algorithm. A small sample of what you can do:
Predict who will churn before they cancel a subscription, and automatically trigger retention offers.
Build a recommendation engine that suggests movies, songs, or products (like Netflix or Amazon).
Detect fraud in real time by spotting unusual transaction patterns in millions of records.
The work and the problems data scientists solve depends highly on the data and the company as each company has their own unique problems to solve. So I could spend all day listing it out and probably spend all year explaining it. At that point just go to college.
(For those of you interested in further reading on the statistical tools and machine learning I highly recommend Elements of Statistical Learning, which is completely free online)
The Current State of Data Scientists
The description for data scientists may be “those who use data to help people/businesses solve problems”, but that’s so very generic. I feel that job description encompasses half the goddamn world of white collar professions now.
Nowadays, I’m not even quite sure myself what data scientists do from job to job or industry to industry. The jobs I’ve come across are far and wide in terms of what they require and, wouldn’t you know it, what they want from you based on these job descriptions is often different from what the job is actually like.
In my opinion, data scientists are still people hired to solve problems using data. That has not changed. But the means and the methods have.
The titles have become so very convoluted: I’ve seen pretty much every variation of (Data/AI/ML/Analytics/etc.) + (Engineer/Analyst/Scientist/etc.) Of course there’s a buttload of other titles that deal with analyzing data and creating solutions in specific industries like financial/process/analyst but I think the data scientist title is somewhat sought after (I still remember at an old job how they simply just changed all the Statisticians to Data Scientists without really changing the work.)
Data scientists spend all day reducing variance in their models but the data scientist job descriptions have the highest variance of them all
A typical data scientist job nowadays can range from machine learning, to A/B testing, even to linear optimization problems and GenAI.
Here’s just a small sample of job descriptions:



Same title, and yet wildly different responsibilities. But that’s to be expected with the breadth of fields and the unique data challenges each industry faces.
For instance, Netflix might want someone more familiar with recommender systems and A/B testing while Meta might want someone who essentially is just a software engineer with extra steps. Marketing/growth roles might want experience with churn models, etc., etc. down the rabbit hole we go. Pretty soon people start asking you questions about biostatistics and Geospatial statistics and you can’t help but wondering “how did I get here?” “did I take a wrong turn somewhere?”.
The technologies used have shifted quite drastically or not so drastically (depending on where you were the past 10 years or so).
The tech stack these days is really simple! You got your core three: Python, R, SQL (or, god forbid, SAS 🤮) for scripting/programming and querying data. Then you got your big three cloud service providers: AWS, GCP, and Azure. Maybe some cloud-native stuff like Snowflake and Databricks. Then you got your data visualization tools like Tableau, Power BI, and Python visualization libraries like matplotlib, seaborn, plotly (or the far superior R libraries like ggplot2). And of course, we need the NumPy and Pandas libraries for data! That almost goes without saying.
Ok got that? Good! Now for the ML frameworks in Python: scikit-learn, PyTorch (and/or TensorFlow, Keras). You want to make interactive apps, you got Streamlit, Dash, Gradio, and Shiny, or just use a Jupyter notebook. But if you wanna get a serious about Deployment and MLOps, we got Kubernetes, Docker, MLflow, FastAPI/Flask,
Need a breather? Too bad! Now there’s GenAI stuff: frameworks like LangChain, LlamaIndex, vector databases like Pinecone, Weaviate, Chroma. AI tools that are increasingly common like Cursor and whatever built in GenAI platform each cloud service provider has.
AND don’t forget we need to use Git for version control too!
SO Easy! SO simple!
Jokes aside, this isn’t to say that every data scientist needs to know all of these. But I think it really is necessary based on the jobs I’ve seen to know a few from each bucket and have a general idea of when/why you would use them. Those tools, tech, and job roles have already splintered into several roles. The main ones I see often are data analyst, machine learning engineer, and data engineer.
Data analysts do exactly what you’d think: they analyze data. I think the main difference here is they don’t often use machine learning or AI on the data.
There’s machine learning engineers who sit as the bastard child of software engineering and data scientists, with a hint of data engineering. They seem to handle making the prototypes of the machine learning models more production ready.
Data engineers…I still don’t quite understand what they do. They get data for you I think?
There’s also the concept of a full stack data scientist. The unicorn who does it all (often an underpaid, overworked unicorn).
I feel “data scientist” has turned into a catch all term for every single data skill under the sun. On top of LeetCode/coding interviews, there’s system design interviews, and statistics interviews, and on top of that I have to have social skills???
It’s a LOT and somehow it just doesn’t seem fair. But then again…
Predictions for the Future
I don’t think data scientists will be around anymore.
Not because “DATA SCIENTIST IS DEAD” or whatever the alarmists on LinkedIn are saying but because the field is simply starting to mature. There’s better understanding of the role of data and ML/AI solutions (not the least of which is because everyone has to hear about it on a daily basis). People are starting to know what they want for their specific data solutions.
I honestly do see this job fading away or splintering into different jobs or at least having more clearly defined boundaries for job functions. I think people are starting to pick up on the differences between, say, a data engineer, data analyst, or machine learning engineer. Though it is very clear that there is still a significant overlap between them. For instance, I’m still not sure what the difference between, say, a machine learning engineer and full stack data scientist (if there even is a difference).
I keep saying DATA SCIENTIST™ in a tongue-in-cheek sort of way because that title is definitely a bit of marketing. That term was coined for a reason. And now, with the evolution of new tech, we now have all these new exciting jobs. Like there’s *AI ENGINEERS*, and *AI SCIENTISTS* and *RESEARCH SCIENTISTS*.
Same shit. Different day.
Behind it all, there’s always the fundamentals of math, statistics, and good coding.
Take it from Claire Longo who’s been in the game a lot longer than I have:
There’s a new wave of titles: AI Engineer, LLMOps Engineer, Prompt Engineer… And it’s giving me déjà vu.
Around the time I graduated, the same thing was happening with “Data Scientists.” Everyone wanted one, and no one knew what one looked like. The job titles Data Scientist, ML Engineer, and Data Analyst were chaos. Companies hired folks under the same title for totally different roles as every company wanted big data but didn’t know what skill sets to hire for or what tools to provide them with.
[…]The same is happening again for AI Engineers. We’re in the messy middle of defining a discipline that doesn’t fully exist yet. There are tons of nuanced “AI roles” that feel more like specialties within a discipline than a discipline itself.
So what should you do?
Focus on fundamentals. Ignore the title. Build AI and build your resume.
And there’s real comfort in knowing that the fundamentals don’t change.
If it’s tabular XGBoost is all you need, and linear/logistic regression will always be ol’ reliable (except when it’s not). Good coding will always be good coding no matter how many vibe coders there are. Bad data in means bad predictions out.
I’m not trying to be a Luddite or saying that these new skills and tools shouldn’t be learned, I just mean that you shouldn’t needlessly shove in some fancy new tech just for the sake of using it. By all means, learn the new stuff. Just don’t turn every dataset into an excuse to deploy a 70B-parameter LLM when a regression model would’ve done just fine.
Because at the end of the day it always goes back to one fundamental question:
How can we develop value for our beloved shareholders?
So all this to say: what is a data scientist? And what do they do? It’s whatever they need to be to deliver value for the stakeholders .













