IFM develops a Marketplace for mid-sized enterprises based on digital technologies

How Machine Learning Truly Applies to Digital Identity

Two broad categories of machine learning models are clustering and classification. Each of these has its pros and cons and, predominantly, the advertising ecosystem has embraced clustering to solve the problem of identity.

But not only for the advertising industry, these applied technologies are of great importance. It also affects all areas of entrepreneurial action and is tie instruments for entrepreneurs in their challenging decisions in times of globalisation and digitisation.


Simply put, clustering is the grouping of similar data into groups, where data in the same group is more closely related to each other than to data in other groups. With this approach, analogous consumer activities based on the association between data fields are grouped. Data fields like IP, user agent, location, time and content consumed are often used to do this.

Even though the approach is scientific and in some cases highly complicated, it is extremely difficult to overcome the flaws of the individual data fields themselves. For instance, IP inherently is an unreliable data field over the long term because the address can change. Identifiers clustered in this model, in particular cookies, are largely unreliable.

Though clustering solves part of the cross-device identity problem, it falls short for a large portion of the device and consumer population.


Classification, on the other hand, is the process of identifying which label a new observation belongs to, knowing the classification of observations from a fact-based training data set. This method can be used to generate a predictive model to associate multiple advertising identifiers to one consumer or a household.

Using models based on classification for identity resolution involves managing a statistically relevant training set, which has observations from a group of devices known to be linked to the same consumer and/or household.

Having a thorough training set is easier said than done. Imagine collecting training data from your favourite travel website — a consumer might log into the travel website from one or two separate devices to research vacation ideas and deals. It’s highly likely that the same consumer will visit the site anonymously to get quick info like airport information or in-flight movie choices. If that consumer only travels twice each year, the data isn’t being collected on a frequent enough basis to have it included in a thorough training set. Classification models tend to work better for identity resolution if the thoroughness and freshness of the training data can be maintained at all times.