This one is based on a project I recently published. This project is called ClusterIdentities. This lets you upload a number of photos of people, maybe you go on a trip and you want to send photos of people seperately in which those specific people are there, what do you do? You go and upload those photos to ClusterIdentities and it clusters and make folders for every person. You can also drag and drop photos across folders in the UI itself and then hit download to download all folders and just send.

It looks like this –

You can see different people clustered in different folders Person_0, Person_1, Person_2.

This blog will focus on understanding what is happening behind the curtains.
Lets start.

System Pipeline

Pipeline is simple and consist of 6 major steps –

Image Upload

Face Detection

Face Alignment

Face Embedding Generation

Embedding Clustering (DBSCAN)

Grouped Photo Output

System Architecture

I am using FastAPI for the backend, React for the frontend and multiple things for face processing and clustering which we will look into getting into the technical stuff.

React Frontend

FastAPI Backend

Face Processing Pipeline

Output Clusters

Now that we know what we are dealing with here and how let us move on to the fun stuff. What happens after you upload the images. Let’s move on with each step from the pipeline and get into each one of them to the point where everything is clear without getting bored (This means not going into the mathematics part that deeply :P).

Face Detection

Well, obviously the first step is to detect faces from the photos. For this I used the InsightFace library which internally uses the SCRFD (Sample and Computation Redistribution for Efficient Face Detection) face detection model to locate faces and return bounding box coordinates.

Example: 1080 x 1920 resolution photo can be represented as I[1080][1920][3]
Height x Width x 3 as Image is a 3D array of numbers.

For face detection, a deep neural network analyzes the image and predicts:

Bounding boxes around potential faces
– Detection confidence score (probability the region contains a face)
– Facial landmarks (eyes, nose, mouth corners)

So what is happening inside this neural network –

1. Feature Extraction – Image is processed through a Convolutional Neural Network (CNN). Now in my early blogs I have also designed a neural network from scratch without any libraries that takes input an image of a number and returns what number it is in the image. That blog will show you how neural networks are built and their architecture – https://vaibhavshrivastava.com/digit-recognizer-building-neural-network-from-scratch-only-numpy-and-pandas/
The result is a feature map representing visual structures useful for detection.
So, here is an overview of the layers of this neural network –

Layer depthWhat it detects
Early layersedges, gradients
Middle layersshapes, textures
Deep layersfacial parts (eyes, nose, mouth)

2. Multi-Scale Face Detection – Now, in an image there can be tens, hundreds of faces (suppose you clicked a portrait but in background there are multiple people). Faces can be small, big, partial. Now we have to understand we are dealing with a machine here and we have to tell and explain it what we want.

Modern detectors like SCRFD or RetinaFace use a Feature Pyramid Network (FPN).

What this does is it just makes a pyramid and sorts out faces like –
Feature Pyramid
├── Large faces
├── Medium faces
└── Small faces

Each level of the pyramid predicts possible faces.

3. Bounding Box Prediction – Now the model creates boxes around the faces so the network predicts box location and face probability.

B=(x1​,y1​,x2​,y2​) – Bounding box
x1, y1 – top left corner and x2, y2 bottom right corner

The model predicts adjustments using bounding box regression:
B′=B+ΔB

This regression if you do not understand what we are talking about you can read one of my blogs on regression – https://vaibhavshrivastava.com/how-linear-regression-model-actually-works/

This refines the box to tightly fit the face.

4. Face Classification – For each predicted region, the network outputs a confidence score.
This score is computed using a sigmoid activation:

You can read about this more from one of my blogs on classification and logistic regression here – https://vaibhavshrivastava.com/logistic-regression-from-basics-to-code/

So sigmoid activation function is a mathematical function that maps any real-valued number into a range between 0 and 1. Now it is upto us to set the threshold of what we want to accept and what we dont want to.
Example: If,
P(face) > threshold then only the region is accepted as a valid detection.

5. Facial Landmark Prediction – Modern face detectors also estimate keypoints. This is useful for things like when you want to change faces or something, you get the idea.

Typical landmarks:

left_eye
right_eye
nose
mouth_left
mouth_right

These are predicted as coordinates:L=(xi,yi)L = (x_i, y_i)

Landmarks allow face alignment in the next pipeline stage.

6. Non-Maximum Suppression (Removing Duplicate Detections) – It sounds complex but it is very simple. Just take out a threshold value which you think should be rejected as duplicate and perform Intersection over Union.

B1 and B2 being 2 bounding boxes.

If overlap is high:
IoU > threshold
the lower-confidence box is removed.

Any project related to people faces or identities need this step to be done accurately. Incorrect face detections will produce bad embeddings and it will fail before even getting started. Taking care of the above things will get you the best embeddings possible and you can work with those in the next steps of whatever you are building.

Face Embeddings (Identity Representation)

We talked about embeddings in end of the previous section so lets get into it.
Once faces are detected and aligned, the system must answer a harder question: How do we represent a face in a way that captures identity?
The solution is face embeddings — a numerical representation of a face such that:

Same person – embeddings are close
Different people – embeddings are far apart

What is a face embedding?

A face embedding is a fixed-length vector produced by a deep neural network.

Example:
Face → [0.12, -0.44, 0.88, …, 512 values]
This vector encodes identity-specific features, not raw pixels.

In the project, we used ArcFace model. We should understand when it comes to the libraries and models we are talking about, its all numbers. Just numbers being arranged in a certain way based on the information.

Normalization – Embeddings are normalized to lie on a unit hypersphere:

Why?

Removes magnitude differences
Focuses only on direction (identity)

This means:

You divide the vector by its length (magnitude)
The new vector has length = 1

Intuition (Very Important) – Remember Intuition is where the magic happens (or happened)

Think of each embedding as an arrow in space.

Before normalization:

Vector A → long arrow
Vector B → short arrow
Vector C → medium arrow

After normalization:

All arrows → same length (1)
Only direction matters.

Why Magnitude Is a Problem

Without normalization, two embeddings can differ in:

  1. Direction (identity) ✅ important
  2. Magnitude (irrelevant noise) ❌ not important

Magnitude can vary due to:

Example:

Same person:
f1 = [10, 2, 1]
f2 = [5, 1, 0.5] Different magnitude, same direction

These should be treated as same identity.

After Normalization

f^=ff\hat{\mathbf{f}} = \frac{\mathbf{f}}{\|\mathbf{f}\|}Now:

f1 → [0.97, 0.19, 0.09]
f2 → [0.97, 0.19, 0.09]

They become identical

Normalization removes magnitude variations and ensures that identity is represented purely by the direction of the embedding vector.

If embeddings are not normalized:

Same person → might look far apart
Different people → might look close

After normalization:

Same person → tightly grouped
Different people → clearly separated

Note: A hypersphere is the dimensional generalization of a sphere, representing the set of points equidistant from a central point in dimensional space.

To understand it better, let us take euclidean distance –
Case 1: Same Person

f1 = [10, 2]
f2 = [5, 1]

These are: different magnitude, same direction
Euclidean distance:

Looks far apart (wrong) – It should be similar.

Case 2: Cosine Similarity

These vectors are proportional → angle = 0

cos(θ) = 1

Correctly identifies same identity

Why Cosine Specifically? – I asked this myself when building the project

Cosine similarity directly measures:cos(θ)\cos(\theta)

Where:

Interpretation

cos(θ) = 1   → identical direction (same person)
cos(θ) = 0 → orthogonal (unrelated)
cos(θ) < 0 → opposite (very different)

Please imagine the above degrees (0 degree, 90 degree etc)

I also asked can I use Sin or Tan to calculate but no – sin can be used but you have to flip the narrative (90 degree apart vector gives 1).
Tan is undefined at 90 degrees.

Cosine → measures “how same”
Sine → measures “how different”
Tangent → unstable chaos

This is just for curiosity which I had while building it.

Now we know why embeddings are important and what happens and how we produce them from images.
face.embedding in arcface does this –

Aligned Face

Deep Neural Network (ArcFace)

512-D Identity Vector

Everything after this step depends on embeddings.

If embeddings are good:

Clustering becomes easy
Accuracy becomes high
System becomes scalable

If embeddings are bad:

Different people mix together
Same person splits into multiple clusters

If we put these embeddings in a database and use vector indexing for searches and perform clustering at bigger scales – we can have a big system like google photos developed, I think and hope :P.

Once each face is converted into a numerical identity vector, the next step is to group similar faces together. Since we do not know how many unique individuals are present, we use an unsupervised clustering algorithm to automatically discover these groups.

Lets move on to the next step!

Clustering

At this point, we have each face as a 512D embedding vector but we dont know how many people are present, which faces belong together etc.

This seems like an unsupervised learning problem doesnt it? Since the number of unique individuals is unknown beforehand, this becomes an unsupervised clustering problem.

Problem Definition –
Given:
– N face embeddings

Goal:
– Group embeddings belonging to same identity

Note: We are not using K-means here as we dont have a predefined number (Real world images have unknown No. of people) of clusters and K-Means requires a predefined value (K).

We use DBSCAN for clustering, but why?

DBSCAN:
– does not require number of clusters
– groups based on density
– can handle noise

How DBSCAN works?

  1. Pick a point
  2. Find neighbors within distance (eps)
  3. If enough neighbors → form cluster
  4. Expand cluster

Similarity metric as seen in the previous section is the cosine similarity.

We can assign similarity threshold, min samples required to form a cluster and many other options which you can also inspect from code and obviously go into the library and learn its functions etc.

This will be the last step for the machine learning part.

Conclusion

This project is a practical example of how modern AI systems bridge the gap between raw data and meaningful insights.

Here are the relevant links –
Regression working (very important for beginners) – https://vaibhavshrivastava.com/how-linear-regression-model-actually-works/
Logistic regression (classification problems) – https://vaibhavshrivastava.com/logistic-regression-from-basics-to-code/
Calculus basics and understanding (highly recommended for understanding) – https://vaibhavshrivastava.com/understanding-calculus-and-derivations-part-1-differential-calculus-imagine-solve/

Project links –

Github repository – https://github.com/INNOMIGHT/photo-identity-sorter
Project URL – Try it yourself – https://innomight.github.io/photo-identity-sorter/

This blog took time but it makes it easy to understand how things work, I hope.

Keep learning, see you in the next one!