In a recent Zarego Session, I set out to answer a very practical question: what does it actually mean to build artificial intelligence into real products? Not demos or buzzword-driven prototypes, but systems that can be reasoned about, measured, and improved over time.
I structured the session in two parts. The first half focused on concepts: how to think about AI as an engineering discipline, what kinds of problems it can realistically solve, and why some approaches make more sense than others. The second half moved into live coding, using a concrete dataset to show how those ideas translate into working software.
This article covers that first part: the mental models, distinctions, and practical criteria I believe matter before you ever write a line of AI-related code.
Not Everything Is an LLM
One of the first points I wanted to make is also one of the most important: large language models are powerful, but they are not universal solutions.
I often meet teams that immediately reach for an external LLM API whenever “AI” comes up. Sometimes that’s the right choice. Many times, it isn’t. LLMs are expensive to run, introduce latency, rely on third-party policies, and are optimized for a very specific class of problems: language understanding and generation.
Using them to solve problems that could be handled with simpler models can often be overkill. Worse, it can reduce the value of your product. If your “AI” is just a thin layer on top of someone else’s platform, you don’t truly control it. Pricing changes, availability issues, regional restrictions, or shifting terms of use can instantly undermine your solution.
Building your own models where it makes sense gives you control, predictability, and real product differentiation. That doesn’t mean avoiding LLMs. It means choosing the right tool for the problem instead of defaulting to the most fashionable one.
Machine Learning vs Deep Learning
Another source of confusion I see frequently is the way people use machine learning and deep learning as if they were synonyms. They’re not.
Machine learning, in its classical sense, refers to models that can usually be described mathematically in a fairly explicit way. Algorithms like k-nearest neighbors, linear regression, logistic regression, or decision trees fall into this category. These models require human decisions during setup: how features are represented, how parameters are tuned, and how performance is evaluated.
Deep learning relies on large neural networks with many layers. Once defined, these models learn internal representations automatically during training. They can be extremely powerful, but they are also opaque. You generally know what they do, but not why a specific prediction was made.
Neither approach is inherently better. Some problems are well suited to deep learning. Others are solved more effectively, cheaply, and transparently with traditional machine learning. Many real-world projects benefit from trying multiple approaches and choosing based on measured performance rather than assumptions.
For the example I used in the session, I focused intentionally on machine learning, precisely because it allows us to see, measure, and reason about each step.
Two Types of Problems: Classification and Regression
Once you move past the tools, the next question is: what kind of problem are you actually solving?
Most applied machine learning tasks fall into one of two categories: classification or regression.
Classification problems: are about deciding which category something belongs to. A new email arrives and the system decides whether it’s spam or not. A person enters a building and the system decides whether they’re an employee, a visitor, or a threat. The output is a class, often accompanied by a probability.
Regression problems: are about predicting a number. Estimating energy consumption, forecasting demand, or predicting how long a process will take all fall into this category. The model doesn’t choose a box; it produces a value on a continuous scale.
This distinction matters because not every algorithm can solve both types of problems. Logistic regression, for example, is designed for classification. Linear regression is designed for regression. Others, like random forests or gradient boosting methods, can be adapted to either.
Understanding the problem type early narrows your choices and prevents misusing tools that were never designed for the task.
Choosing and Combining Models
With the problem type defined, you can start looking at specific models. In the talk, I briefly surveyed common options: k-nearest neighbors, linear and logistic regression, decision trees, random forests, XGBoost, and related ensemble methods.
Ensemble techniques are particularly interesting because they combine multiple models to improve performance. Bagging, boosting, and stacking all attempt to reduce errors by leveraging different perspectives on the same data.
But “better” here needs to be treated carefully. Which brings us to one of the central themes of the session.
What Does “Good” Actually Mean?
Accuracy is an appealing word, but it’s also dangerously vague.
An AI model used to assist with aircraft landing needs to meet a vastly different error threshold than one used to count customers entering a store. Both can be “accurate,” but the acceptable margin of error is completely different.
This is why measuring model performance is not optional. It’s how you decide whether a system is usable, safe, or valuable in its intended context.
Measuring Error: A Mathemathical approach
At the core of most evaluation metrics is a simple idea: compare what the model predicted with what actually happened.
If the prediction perfectly matches reality, the error is zero. In practice, that almost never happens. There’s always some difference, and that difference is what we measure.
Mean Absolute Error (MAE) takes all those differences, ignores whether they’re positive or negative, and averages them. It answers the question: on average, how far off are we?

Mean Squared Error (MSE) goes a step further by squaring each difference before averaging. This penalizes large mistakes more heavily than small ones, which is useful when big errors are especially costly.

Root Mean Squared Error (RMSE) simply takes the square root of MSE to bring the error back to the original unit of measurement. This makes it easier to interpret while preserving the emphasis on larger errors.

All of these metrics are simply different ways of observing the same idea: the gap between prediction and reality. What matters is that these numbers only make sense within the context of the specific model being trained. They are not absolute values, nor are they meant to be compared across unrelated models. Each model defines its own scale of what “good” looks like.
Classification Metrics and AUC-ROC
For classification problems, error measurement looks a little different. Instead of focusing only on “right” or “wrong,” we often care about how well the model ranks probabilities.
The AUC-ROC curve captures this idea. Rather than asking whether a prediction crossed a fixed threshold, it measures how well the model separates positive cases from negative ones across all possible thresholds.
Intuitively, it answers this question: if you randomly pick one positive example and one negative example, how likely is the model to assign a higher probability to the positive one?
An AUC close to 1 means the model is very good at discrimination. An AUC close to 0.5 means it’s essentially guessing. What matters is not just the final number, but how quickly the model starts making meaningful distinctions as it sees more data.
Data Encoding: Making Data Make Sense
Before any model can learn, the data itself has to be usable. This often requires encoding.
Machine learning algorithms work with numbers. If you feed them raw categories like city names or labels, they will happily perform arithmetic on them in ways that make no sense. Treating “Paris,” “Madrid,” and “Moscow” as numeric values leads to meaningless relationships.
Encoding transforms these values into representations that preserve meaning without introducing false structure. If you skip this step, your model may still produce outputs, but they won’t be reliable.
This is where data preparation becomes just as important as model selection. A simple model with well-prepared data will often outperform a complex model fed with poorly encoded inputs.
From Concepts to a Concrete Example
After laying out these ideas, I moved into a practical exercise: predicting survival on the Titanic.
The dataset includes basic passenger information such as age, sex, and ticket price. The goal is to estimate the likelihood that a given passenger survived. This is a classification problem, and it’s a great teaching example because it’s small, well understood, and still rich enough to demonstrate real trade-offs.
In the second half of the session, these concepts are turned into code using standard Python tools like pandas, NumPy, and scikit-learn. The model is trained, evaluated, and interpreted step by step, showing how abstract ideas become concrete decisions in a real project.
Below, I’ve embedded the recording of the live coding portion, where the model is built and evaluated in real time.
You can find the full implementation here:
Notepad with the full example Download code
Why This Matters for Real Products
The goal of this session wasn’t to turn everyone into a data scientist. It was to demystify AI enough that teams can make informed engineering decisions.
Practical AI isn’t about chasing the newest model or maximizing accuracy at all costs. It’s about understanding your problem, choosing appropriate tools, preparing data carefully, and measuring results against real-world constraints.
That mindset is how AI becomes part of a product rather than a buzzword bolted on top.
At Zarego, this is how we approach AI and machine learning across industries: grounded in product needs, transparent in behavior, and designed to evolve as requirements change. If you’re exploring how AI could fit into your product in a meaningful, sustainable way, we’d be happy to talk.


