Data Readiness: Lessons from the Field for Machine Learning Data Prep

 In Data, Under the Hood

Reading Time: 6 minutes

The field of Machine Learning (ML) is not new, yet marketers are still discovering brand new ways to apply ML methods on their large, complex and expanding data sets. Demand for data science talent continues to grow, but the problems of collecting and normalizing clean, meaningful data for machine learning are snowballing faster than most firms can respond. In order for brands to take advantage of this avalanche of artificial intelligence functionality, it’s critical that they first install a data foundation that’s ready for the task. This requires marketers, data scientists and engineers or developers working together.

In order for brands to take advantage of this avalanche of artificial intelligence functionality, it’s critical that they first install a data foundation that’s ready for the task. Click To Tweet

Our products unique positioning within our clients’ organizations has provided Tealium insight into each step of the machine learning process. We’ve developed the Universal Data Hub to produce ML-ready data in real time, which is one of our strategic mandates. Here are some of our learnings across the lifecycle of an ML project based on working with many clients.

The five steps below represent a typical machine learning project lifecycle (click each one below if you want to skip ahead to that section):

  1. Data collection
  2. Data normalization
  3. Data modeling
  4. Model training
  5. Deploying to production

machine learning data readiness process

Collecte de données

Preparing your customer data for meaningful ML projects can be a daunting task with the sheer number of disparate data sources and data silos that might exist in your organization. In order to build an accurate model, select data to use that is likely to be predictive of your target: the outcome which you hope the model will predict based on other input data. For us, this goes beyond reacting to abandoned carts or recommending a favorite category, but rather is a truly revolutionary capability (not just taking last actions and extrapolating or averaging) to actually predict the future.

For a consumer brand, desirable input data may include web-activity, mobile app data, historical purchase data, and/or customer support interaction data. Traditionally, these data sources may be difficult to access or configure collection. With Tealium’s UDH, deploying data collection on new platforms or devices is expedited to give brands a central point of collection across many data sources.

Data Normalization

Our next step in the ML process is where analysts and data scientists typically spend most of their time on analysis projects: cleaning and normalizing dirty data. Often times this requires data scientists to make decisions on data they don’t quite understand, like what to do with missing data or incomplete data, as well as outliers.

What’s worse– this data may not be easily correlated to the proper unit of analysis: your customer. In order to predict if a single customer will churn (not a segment or whole audience), for example, you can’t rely on siloed data from disparate sources. Your data scientist will prepare and aggregate all of the data from those sources into a format that ML models can interpret. That’s a lot of work…even before any ML can occur.


At this point we should pause and mention a couple of core features of Tealium’s Universal Data Hub that illustrate how these challenges can be solved: Sources de données and Event Specifications: together, these features set the foundation for data collection AND normalization. In a matter of minutes, developers and analysts can test and verify that their data is clean and meets expectations prior to writing a single query.

Tealium has a marketplace of common Event Specifications to configure with your Data Sources, or you even can add your own. These capabilities offer analysts and data scientists a quick and easy way to inspect data and gain familiarity with the business challenge they wish to address with machine learning, while saving time and automating the data collection and normalization process.

Data Modeling

The next phase of our hypothetical ML project is to model the data we wish to use for prediction. Part of modeling data for prediction about customers is to combine disparate data sets to paint a proper picture of a single customer. This includes blending and aggregating silos of data like web, mobile app and offline data.

Take the following three examples of data from a single customer across 3 different channels; the desktop website, a mobile app and in-store transactions:

To summarize the above data, we might make the following derived data points that represent customer-level behavior:

  • First Activity: 9/14/2016
  • Last Activity: 9/27/2016
  • Lifetime Web Visit Count:  2
  • Lifetime Mobile Visit Count: 1
  • Lifetime Transactions Count: 2
  • Lifetime Value: $172.24
  • Favorite Category: Shoes

These are just a handful of meaningful derived data points that a brand might define for a customer. But this is no easy feat. Consider when the brand sees this customer again in the future and wants to predict, for example, the customer’s probability of converting at the start of their next web session… in real time.

To aggregate input data and account for cross-device visitor behavior requires both strong domain knowledge of the subject matter, plus technical competency to accurately manage the data transformation.

To illustrate how we solve the challenge, Tealium AudienceStream Customer Data Platform provides visit AND visitor-level enrichment to allow marketers (in tandem with developers and data scientists or analysts) to define the business rules for these aggregate data points. One (or more) of those data points may be leveraged for Visitor Stitching– our patented method for merging visitor profiles together across devices in real time. The result of these capabilities (in conjunction with Sources de données and Event Specifications) is a clean and correlated single view of the customer, providing the robust data foundation required for machine learning.

Model Training and Feature Engineering

After a brand has deployed collection and enrichment of meaningful input data, it’s time to put the predictive power of that data to the test. To do so, data scientists take a representative sample of the population (i.e. all customers, anonymous visitors, or known prospects) and set aside a portion for training models. The remainder is used to validate the models after training is complete.

A key component of this phase is to iterate rapidly, continuously testing new data points that can be derived from the data source. This process is called feature engineering.

To continue with the earlier example, we may test the following engineered features:

  • Customer Age in Days (difference between first and last activity date): 13
  • Average Order Value (lifetime value divided by total transactions): $86.12

These attributes and others can be easily calculated from the aggregate visitor data, providing data scientists a quick way to iterate on training models to compare accuracy. Tealium’s DataAccess products (for example, AudienceStore) provide a seamless way to export visit and visitor data, in real-time, for training your ML models. If an engineered feature is thought to be beneficial, users can add new AudienceStream attributes in a matter of minutes with the flexible enrichment options and attribute data types. This is how data scientists can be technologically enabled to produce more and better insights instead of requiring them to complete mundane data management tasks repeatedly and manually.

Deploying Models to Production

All of our work to this point culminates in the final step of deploying a model to production where we test our ability to predict outcomes in the real world. By this point, models should meet some threshold of accuracy that warrants deploying them to production. For this reason, it’s important to interpret model performance with stakeholders to agree on what level of risk is acceptable for inaccuracy. Some customer behaviors may not be sufficiently predictable, and thus a model may never achieve accuracy to justify deploying to production.

With the right technology, teams can quickly establish ML-viability with their data, with complete transparency when interpreting their models (instead of some black box prevalent in many AI solutions). This flexibility allows brands to apply their learnings right back into their business.

With the right technology, teams can establish ML-viability w/ their data, with complete transparency when interpreting their models (instead of black box prevalent in many AI solutions). This flexibility allows brands to apply learnings… Click To Tweet

What does this look like with Tealium? We offer a few methods for ingestion so you can import your predictions back into the Universal Data Hub: upload predictions via offline import or in real time, i.e. in response to a live visitor session, using our inbound API.

Once models are live, marketers and stakeholders can finally capitalize on their predictions. This might include serving a promotion to a suspected high-value prospect or suppressing marketing for predicted low-value visitors. Work with stakeholders and martech owners to think through the marketing applications of your predictions.

In the end, machine learning isn’t going to replace your digital marketing strategy, it will augment and enable it. We believe successful brands will put their customer at the center of what they do. Machine learning is one tool among many to optimize decision making as part of that larger initiative.

Final Thoughts

  • Valuable ML applications don’t require billions of data points. Even small data problems can be supported by ML.
  • Not every business problem can or should be solved with ML. Work with stakeholders to educate internal teams on the costs and benefits of ML proactively.
  • Set up your workflow to fail fast: collect data thought to be predictive, perform exploratory data analysis & check feasibility for ML models before embarking on bigger projects.
  • Be cautious: if your initial models show near-100% accuracy, be suspicious of overfitting or target leakage.

Recommended resources:

Recent Posts
Mike Anderson Interview part 5: Data readiness for machine learning