How AI Learns (Part 1): Machine Learning

Written by
Diana Cheung
16 mins
Aug 29, 2025
Subscribe for more insights
Thank you for your interest in the syllabus.

We'll be in touch soon with more information!

In the meantime, if you have any questions, don’t hesitate to reach out to our team.

Oops! Something went wrong while submitting the form.

TLDR - How AI Learns: Machine Learning Basics, Data Types & Algorithms

  • AI learns via machine learning (ML): training models on data to make predictions.
  • Data types in ML: numerical (discrete, continuous) vs categorical (nominal, ordinal).
  • Learning modes: supervised, unsupervised, semi-supervised, self-supervised, reinforcement.
  • Key algorithms: gradient descent, regression, clustering, association rule mining, dimensionality reduction.
  • Applications: prediction, classification, anomaly detection, recommendation systems, robotics.

Introduction 

Recall that AI is a CS discipline that focuses on creating machines that mimic intelligent human behavior. Humans learn by trial and error, pattern recognition, and extrapolating past experiences. But how does AI learn? Machine learning (ML) is a branch of AI that explores this. In this article, we will dive into some core concepts and algorithms of ML that empower computers to learn from data and make predictions without explicit programming.

An artist’s illustration of artificial intelligence. Source: Photo by Google DeepMind on Unsplash (accessed 07/19/2025)

Data

Data is highly important in machine learning, as it is used to train the models. There are two main types of data based on representation: numerical and categorical.

A diagram of numerical vs. categorical data types. Source: https://www.legac.com.au/blogs/further-mathematics-exam-revision/further-mathematics-unit-3-data-analysis-types-of-data (accessed 07/19/2025)

However, the raw dataset values are rarely used for the actual training. Instead, conversion is performed to turn numerical or categorical data into useful floating-point values. These floating-point values make up the feature vectors (containing the characteristics of data points) that are fed into a machine learning model for training.

Numerical Data (Quantitative)

Numerical data consists of integers or floating-point values that behave like numbers, meaning they are measurable, countable, or additive. Time series data are often considered numerical data when each data point in the series is a number, such as sensor readings.

Google's Machine Learning Crash Course explains that although US postal codes are composed of five-digit numbers, they don't behave like numbers nor represent mathematical relationships. Instead, they represent specific geographic areas and thus are considered categorical data.

There are two types of numerical values:

  • Discrete: Represents numerical values that are countable and distinct, with a finite number of possible outcomes.
    • Examples: Number of cars in a parking lot, number of defects in a product, or number of button clicks.
  • Continuous: Represents numerical values that are measurable, with an infinite number of possible values within a range. It can include values with decimal points.
    • Examples: Temperature readings, stock prices, or race completion times.

Common feature engineering conversion techniques:

  • Normalization: Converts numerical values into a standard range, so that features are on a similar scale. Scaling allows the model to learn appropriate weights for each feature, rather than paying too much attention to features with wide spans and not enough attention to those with narrow spans. 
  • Binning (or Bucketing): Converts numerical values into groups or bins of subranges. Binning is appropriate when the feature values are more clustered than linear.

Categorical Data (Qualitative)

Categorical data consists of labels or values that classify objects or individuals. There is a specific set of possible values.

There are two types of categorical values:

  • Nominal: Categories without any inherent order.
    • Examples: Gender (Male/Female), car brands (Toyota, Tesla, BMW), or countries (United States, Japan, Spain).
  • Ordinal: Categories with a meaningful order, but without a consistent numerical difference.
    • Examples: Customer satisfaction levels (Poor, Average, Good, Excellent) or education levels (High School, Bachelor’s, Master’s, Ph.D.).

Common encoding techniques:

  • Low number of possible categories:
    • Vocabulary Encoding: Assigns a unique integer index for each unique categorical value. For example, with a categorical feature named car_color, the assignment could be (Red → 0, Blue → 1, Green → 2). However, the encoded integers aren't directly used for training, as the model would incorrectly imply an ordinal relationship. Vocabulary encoding is often a preprocessing step for further encoding.
    • One-Hot Encoding: Converts each categorical feature into a vector with length equal to the number of possible categorical values. Building upon vocabulary encoding, a "1" at the unique integer index position corresponds to the assigned categorical value (Red → [1.0, 0.0, 0.0], Blue → [0.0, 1.0, 0.0], Green → [0.0, 0.0, 1.0]). Possible categorical values are treated as distinct and unrelated, avoiding misinterpretation of ordinal relationships.
  • High number of possible categories:
    • Embedding: Transforms categorical data into numerical vectors that capture the relationships among various categories or objects. Unlike one-hot encoding, which is a fixed representation, embeddings are learned by projecting the initial data vectors from a high-dimensional space to a lower-dimensional space.

Learning Modes

The main differentiator in learning modes is the use of labeled data (consisting of input-output pairs) for training. Labeled data allow for high-accuracy training, but are intensive in time and labor because the labeling tasks are manually performed by humans.

A table summarizing the five learning modes in machine learning.

Common Machine Learning Algorithms

Gradient Optimization

The concept of gradient stems from mathematics, representing the rate and direction of change of a function. It's a vector pointing in the direction where the function decreases or increases most rapidly. In machine learning, the gradient indicates how to change model parameters to most efficiently decrease or increase a target function, such as the error in supervised learning or the reward in reinforcement learning. Gradient optimization may be used across the learning modes.

Gradient descent is a fundamental optimization algorithm to minimize the cost function. During training, the algorithm iteratively adjusts the model's parameters (e.g., coefficients in linear regression or weights in a neural network) that result in the lowest possible error. A loss function is a mathematical formula that measures the difference (or error) between the predicted value and the actual value for a single data point. A cost function aggregates the loss values across the entire training dataset, resulting in a single overall measure of performance.

Gradient ascent is the opposite optimization algorithm to maximize the likelihood function or reward. The algorithm iteratively adjusts the model's parameters that result in the highest probability or reward.

Supervised Learning

Supervised learning is a core machine learning approach where a model is trained using labeled data. Generally, 80 percent of the labeled data is used for training, and the remaining 20 percent is used for testing. Supervised learning solves two main problem types: regression (numerical continuous output) and classification (categorical output).

A diagram of supervised learning. Source: https://www.enjoyalgorithms.com/blogs/supervised-unsupervised-and-semisupervised-learning (accessed 07/19/2025)

Linear Regression

Linear regression is a simple statistical method that is used to predict a numerical continuous output. It seeks to find the best-fit line, which is a straight line that minimizes the difference (or error) between the provided output values and the predicted values. It assumes a linear relationship between the input and output. The Mean Squared Error (MSE) is commonly used as the cost function.

This method can make future predictions based on historical outcomes, applicable to industries such as sales, finance, and healthcare. For example, provided a dataset of house features (lot size, number of bedrooms, number of bathrooms, etc.) and price, linear regression can be used to learn the relationship between the house features and selling price. Once the relationship is established, it can be used to predict the price of other houses.

Logistic Regression

Logistic regression is used for classification problems, predicting the probability that an input belongs to a specific class. It transforms a linear regression continuous output into a categorical output by using a sigmoid function. The sigmoid function maps any real number into a probability value ranging from 0 to 1, forming an S-curve. A threshold is set (commonly 0.5) to decide the class label. The parameters of the model are optimized using gradient ascent on the log-likelihood function.

There are three main types:

  • Binomial Logistic Regression: Used for binary classification problems with only two possible categorical values (e.g., "yes" or "no").
  • Multinomial Logistic Regression: Used for three or more possible categorical values that are unordered (e.g., classifying eye color: "brown," "blue," or "green").
  • Ordinal Logistic Regression: Used for three or more possible categorical values that are ordered or ranked (e.g., classifying ratings: "low," "medium," or "high").

Logistic regression applies to many use cases across multiple industries. For example, breast cancer diagnosis (binary classification of "benign" or "maligient"), handwritten digit recognition (multinomial classification of "0-9"), and customer service review (ordinal classification of "poor," "fair," "good," or "excellent").

Unsupervised Learning

Only unlabeled data is used to find hidden patterns and relationships.

A diagram of unsupervised learning. Source: https://www.enjoyalgorithms.com/blogs/supervised-unsupervised-and-semisupervised-learning (accessed 07/19/2025)

Clustering

Clustering groups unlabeled input data based on similarities or differences.

There are three types of traditional "hard" clustering methods:

  • Centroid-Based (partitioning): Separates data points into a predefined number of clusters that are represented by centroids (central vectors). Each data point is assigned to the cluster with the closest centroid, based on the selected similarity measure, such as Euclidean Distance. Eventually, the centroid becomes the mean of all its assigned data points. This type of clustering is simple and efficient, but tends to perform better with spherical clusters. Use cases include customer segmentation, document classification, and image compression.
    • K-Means: The most popular algorithm that assigns data points to a predefined number of k clusters with random centroids. It iteratively updates the centroids to be the mean of their assigned data points until the sum of distances between the assigned data points and centroids is minimized. However, this algorithm isn't robust against outliers.
    • K-Medoids: Similar to the K-Means algorithm, but uses actual data points as cluster centers. A medoid of a cluster is the data point whose dissimilarities with all other points in the cluster are minimized. This algorithm is more robust to outliers.
  • Density-Based: Creates clusters by identifying high-density regions within the data points and marks data points in low-density regions as outliers. This type of clustering can find arbitrarily shaped clusters, is robust against outliers, and doesn't require a predefined number of clusters. A limitation is the difficulty of handling high-dimensional data. Applications cover anomaly detection (e.g., frauds and defects) and geospatial analysis (e.g., hotspots like crime areas and traffic accident zones). 
    • DBSCAN (Density-Based Spatial Clustering of Applications with Noise): This algorithm requires two key parameters to be specified upfront. Eps (ε) specifies the radius defining the neighborhood around a data point (two data points are considered neighbors if the distance between them is less than or equal to ε). MinPts dictates the minimum number of data points required within the ε-radius to qualify a data point as a core point. For each data point, the algorithm identifies all neighbors within the ε-radius to determine whether it's a core point, border point, or noise. Clusters are formed around core points. The algorithm continues to process unvisited data points until all are assigned to clusters or detected as noise. This algorithm is robust to outliers and noise, but has high parameter sensitivity for ε and MinPts.
  • Connectivity-Based (hierarchical): Groups data points into a tree of nested clusters, also known as a dendrogram, based on a measure of distance linkage. Some common Euclidean distance linkage methods are Min linkage (minimum distance between clusters), Max linkage (maximum distance between clusters), and Average linkage (average distance between all pairs of data points in the clusters). This type of clustering provides a clear representation of the relationships between clusters and data points. However, it's computationally intensive. It is widely used in fields such as bioinformatics (e.g., gene expression analysis) and social sciences.
    • Agglomerative (bottom-up): Repeatedly merges clusters into larger ones until a single cluster emerges or reaches a predetermined number of clusters.   
    • Divisive (top-down): Repeatedly splits clusters into smaller ones until all clusters are singletons or reaches a predetermined number of clusters.

Association

Association rule mining identifies rules that describe how variables (features) occur together in a dataset. These rules spot frequent connections and patterns among itemsets (different groups of features). A common use case is in healthcare to find strong association rules between symptoms and diseases for better diagnosis. Another popular use case is market basket analysis to better understand the sale of one product in relation to other products based on customer behavior. For example, a rule may point out that the sale of eggs often co-occurs with the sale of bacon or butter. Retailers can leverage that insight to increase sales with product offers and placement.

Some common association rule mining algorithms:

  • Apriori Algorithm: A bottom-up approach that starts with itemsets of size one. It iteratively expands frequent itemsets one item at a time while removing infrequent itemsets based on the minimum support threshold. It's a simple algorithm, but it can be computationally intensive for large datasets. The following are three key metrics:
    • Support: The frequency in which an item appears in the dataset.
    • Confidence: The likelihood that an item Y appears in transactions containing item X.
    • Lift: Measures how much more likely two items are to occur together compared to occurring independently. A lift greater than one hints at a strong positive association.
  • Frequent Pattern Growth Algorithm (FP-Growth): It first compresses the dataset into a special structure known as the Frequent Pattern Tree (FP-Tree), which stores information about the itemsets and their frequencies without candidate generation. The FP-Tree is then examined for frequency patterns based on the minimum support threshold. Lastly, the rules and frequent itemsets are generated. This algorithm is more efficient and scalable for large datasets.
  • ECLAT Algorithm (Equivalence Class Clustering and bottom-up Lattice Traversal): Unlike the Apriori algorithm, ECLAT uses depth-first search and stores data in a vertical layout. Each item is linked to a list of transaction IDs to count the support metric for itemsets. This approach makes it faster and efficient for datasets with many frequent itemsets.

Dimensionality Reduction

Dimensionality reduction is a process of lowering the number of dimensions (or features) in a dataset while preserving meaningful information. This can simplify complex datasets while also minimizing redundant features and noise. It's useful for preprocessing data fed into machine learning models and for data visualization purposes.

Some common dimensionality reduction methods:

  • Principal Component Analysis (PCA): PCA works by feature extraction, combining and transforming the dataset's original features to create new principal components. There's an ordering for the principal components: the first captures the largest variance in the dataset, the second captures the next largest (orthogonal to the first), and so forth. It is simple and fast, but only effective for linear relationships.
  • t-Distributed Stochastic Neighbor Embedding (t-SNE): Unlike PCA, t-SNE is a non-linear dimensionality reduction technique mainly used for visualizing high-dimensional data in 2D or 3D. It converts the similarity distance between pairs of data points into probabilities and tries to minimize the divergence between the probability distributions in the high-dimensional and low-dimensional spaces. The point positions in the low-dimensional space are iteratively updated to best preserve the relationships in the original high-dimensional space. This technique is good for revealing clusters for datasets with non-linear relationships. However, it is computationally intensive and not suitable for general-purpose dimensionality reduction.

Semi-Supervised Learning

Semi-supervised learning is a hybrid approach that uses both labeled and unlabeled data for training. It's suitable when there is an abundance of unlabeled data, but costly or difficult to manually label them all. It can be used for classification tasks. Some applications include speech analysis (intensive to label audio files) and internet content classification (a massive amount of webpages).

A diagram of semi-supervised learning. Source: https://www.enjoyalgorithms.com/blogs/supervised-unsupervised-and-semisupervised-learning (accessed 07/19/2025)

Semi-supervised learning relies on the following assumptions:

  • Continuity Assumption: Data points that are close together are likely to have the same label.
  • Cluster Assumption: The data points can be organized into discrete clusters, and data points in the same cluster are likely to have the same label.
  • Manifold Assumption: The high-dimensional input data can be represented in a low-dimensional space (called the data manifold). So the labeling of a data point is based on the learned data manifold.

There are two general approaches:

  • Inductive: Seeks to train a classifier that can model the entire input dataset (labeled and unlabeled) and assign correct labels to new data points. First, it uses a supervised classification algorithm to train a base model with the labeled data. Then it uses the partially-trained model to generate pseudo-label predictions for the unlabeled data points. Finally, the model is re-trained using the original labeled data and the pseudo-labeled data.
    • Self-Training: This method gives probabilistic pseudo-label predictions. For example, "75 percent dog, 25 percent cat" rather than just "cat."
    • Co-Training: This method trains multiple base models to assign pseudo-labels. To add diversification, use different supervised classification algorithms for each model. Or allow each model to focus on different subsets of the dataset.
  • Transductive: Aims to produce label predictions for the unlabeled data only. It doesn't develop a general rule for unseen data points.
    • Label Propagation: This is a graph-based algorithm that assigns labels for the unlabeled data points based on their relative similarity or connectivity to the labeled data points (continuity and cluster assumptions).

Self-Supervised Learning

Self-supervised learning is a newer approach to machine learning. Similar to unsupervised learning, it is only provided with unlabeled data. Yet, it is used for traditional supervised tasks, such as classification and regression, that require comparing the models' predictions with ground truths. Instead of being given the ground truths as labeled data, the models generate their own pseudo-labels inferred from the unlabeled data. During training, the models perform pretext tasks on the unlabeled data for representation learning (finding intrinsic correlations and patterns). The models use gradient descent to minimize the error between the predictions and pseudo ground truths. The ground truths may be updated during the iterations. Afterwards, the trained models are typically fine-tuned via supervised learning for downstream tasks (actual applications) to improve accuracy.

A diagram of self-supervised learning. Source: https://www.geeksforgeeks.org/machine-learning/self-supervised-learning-ssl/ (accessed 07/19/2025)

This approach is applicable to fields where large amounts of labeled data are difficult to obtain due to high cost and time. For example, computer vision and natural language processing (NLP). However, self-supervised learning requires substantial compute power to train models on large datasets.

There are two main techniques:

  • Self-Predictive Learning: Trains models to "fill-in-the-blanks" by predicting a part of the input with known information about the other parts. For instance, a computer vision model is provided with the top half of an image and asked to generate the bottom half. In NPL, a model might need to predict a masked word in an input sentence.
  • Contrastive Learning: The models are provided with data pairs and tasked with distinguishing between similar and dissimilar items. These data pairs are usually created via data augmentation, in which transformations are applied to the unlabeled data to create new instances. For example, image data might be augmented via rotation, cropping, and coloring. Through data augmentation, the models are exposed to more variability and perspectives for representation learning.

Reinforcement Learning

The Markov decision process (MDP) lays out the relationship between the agent and its environment. The agent interacts with its environment by understanding the current state and taking possible action(s). It then receives a reward (or penalty) signal and the updated state. Through trial and error, the agent learns which action(s) to take for a specified goal. 

The agent discovers strategies for choosing actions to get the optimal cumulative reward. With the value-based approach, the agent learns a value function that estimates how rewarding a specific state or particular action might be. In contrast, the policy-based approach doesn't rely on explicit value estimates, but rather learn the policy (direct mapping of states to actions). The policy-based approach is more scalable, while the value-based approach is more sample-efficient (learn effectively from a small number of examples).

Also, the agent must balance the exploration-exploitation trade-off, deciding to explore the environment more or just pick from known rewarded actions.

A diagram of the Markov decision process in reinforcement learning. Source: https://en.wikipedia.org/wiki/Reinforcement_learning (accessed 07/19/2025)

Model-Based

With the model-based reinforcement learning approach, the agent first constructs an internal representation (or model) of its environment. This is suitable for well-defined and stable environments. For example, a vacuum robot learning to navigate a new house. At first, the robot roams freely to explore and build an internal map of the house. Afterwards, the robot can build a series of optimal path sequences.

Some model-based algorithms:

  • Dyna: The agent learns from real-world data and simulated experiences via a learned model. This hybrid approach allows for sample efficiency by augmenting limited real-world data with ample simulated data. It's applicable for robotics, autonomous driving, financial trading, and any task that can be optimized by both real and simulated data.
  • PILCO (Probabilistic Inference for Learning Control): Utilizes probabilistic models (typically Gaussian processes) to model dynamics, accounting for uncertainty in the planning and optimization of policies. This method is suitable for continuous control tasks and when real-world trials are costly or risky. It's also extremely sample-efficient.
  • Dreamer: This algorithm builds a latent dynamics model from pixels, meaning a compressed predictive model from raw visual data. Policy learning and optimization occur within this simplified space. Dreamer is scalable for high-dimensional input and appropriate for vision-based control tasks.

Model-Free

With the model-free reinforcement learning approach, the agent doesn't construct an internal representation (or model) of its environment. It learns the value of actions or develops a set of policies through direct interactions. This is suitable for environments that are complex, large, or changing. For example, a self-driving car is learning to navigate a new city. The environment is dynamic and complex, due to traffic conditions, pedestrian behaviors, and road systems. The car is first trained in a virtual environment to develop its values or policies. Once released in the physical environment, it continues to update them with new data. 

Some model-free algorithms:

  • Q-Learning: A widely used value-based algorithm that maintains a Q-table where each entry is an estimate of the expected long-term reward for the specific state-action pair. The table is updated using the Temporal Difference (TD) rule. It's simple and effective for discrete action spaces.
  • Policy Gradient (PG): A class of policy-based algorithms, where the policy determines the probability of taking each possible action for a specific state. The policy parameters are optimized using a gradient. PG is applicable for continuous action spaces.
    • Monte Carlo Policy Gradient (REINFORCE): A classic, straightforward technique that uses Monte Carlo sampling, collecting full trajectories (sequences of states, actions, and rewards), to compute the gradient of the expected reward. The policy is updated to make good actions more likely in the future.

Summary

In machine learning, the models are trained on data that is represented as numerical or categorical. Depending on the data available (labeled and unlabeled), task goals, and resources (compute, time, and money), the machine learning model will be trained using a specific learning mode. Again, the learning modes are supervised, unsupervised, semi-supervised, self-supervised, and reinforcement learning. Under each mode are common algorithms and techniques, each with its own pros and cons.

Find out how we cover AI/ML in our updated curriculum
Get your Syllabus
Special blog guest offer!

Explore CS Prep further in our beginner-friendly program.

Get 50% Off CS Prep
Learning code on your own?

Get more free resources and access to coding events every 2 weeks.

Thank you for your interest in the syllabus.

We'll be in touch soon with more information!

In the meantime, if you have any questions, don’t hesitate to reach out to our team.

Oops! Something went wrong while submitting the form.
Want to learn more about advancing your career in tech?

Connect with one of our graduates/recruiters.

Schedule a Call

Our graduates/recruiters work at:

ABOUT THE AUTHOR

Diana Cheung (ex-LinkedIn software engineer, USC MBA, and Codesmith alum) is a technical writer on technology and business. She is an avid learner and has a soft spot for tea and meows.

Diana Cheung
Alumna, Software Engineer

Related Articles

Top 10 Beginner Coding Platforms to Kickstart Your Programming Journey

Tutorial
JavaScript
by
Rory James
Sep 18, 2025
|
5 mins

How AI Learns (Part 1): Machine Learning

AI/ML
by
Diana Cheung
Aug 29, 2025
|
16 mins

Building an AI-Powered Chatbot with Next.js and OpenAI

Tutorial
AI/ML
by
Harikrishna Kundariya
Aug 15, 2025
|
10 minutes

Start your journey to a coding career.

Thank you for your interest in the syllabus.

We'll be in touch soon with more information!

In the meantime, if you have any questions, don’t hesitate to reach out to our team.

Oops! Something went wrong while submitting the form.
Want to learn more about advancing your career in tech?

Connect with one of our recruiters to learn about their journeys.

Schedule a Call

Our graduates/recruiters work at: