Using Azure ML to Build Clickthrough Prediction Models

Using Azure ML to Build Clickthrough Prediction Models


This blog post is by Girish Nathan, a Senior Data Scientist at Microsoft.

Ad click prediction is a multi-billion dollar industry, and one that is still growing rapidly. In this post, we build ML models on the largest publicly available ad click prediction dataset, from Criteo. The Criteo dataset consists of some 4.4 billion advertising feedback events. In Criteo’s words, “…this dataset contains feature values and click feedback for millions of display ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction.”

Azure services provide the tools needed to build a predictive model using this data. We use an Azure HDInsight cluster to load the Criteo data into Hive tables, and use Azure ML to build ML models on the dataset and understand it better. In particular, we show how to use the learning with counts technique to produce compact summaries of high dimensional categorical variables.

Problem Statement

Given Criteo’s click prediction dataset, we ask the following question:

For a given example, will the user click or not?

We model this as a binary classification problem, where a click gets the label “1” and lack of a click gets the label “0”.

We use the Azure ML DRACuLa (learning with counts) modules for building count features on the categorical data, and a two-class boosted decision tree learner for the binary classification problem. For manipulating the data prior to building counts, we use Azure HDInsight clusters.

Getting Access to the Data

The Criteo dataset is available here. After you accept the terms of use, you are directed to a page with details on the dataset and information on how to access it publicly. Note that this dataset is hosted in a public Azure Blob Storage account.

The Criteo Dataset

The Criteo dataset consists of 24 days of data, one file for each day. The data is in gzip format. For each row, the first column “Col1” indicates whether a click occurred (“1”) or not (“0”). Each row has 39 features in remaining columns; of these the first 13 are numeric while the last 26 are categorical features.

An excerpt of the data:

Schema : Column names Col1 – Col40 [ Col1 is the label column ]

For reasons of space, we show a truncated example below:

Col1, Col2, Col3,…,Col15,Col16,Col17,…,Col40

Per the data description, the numeric features typically represent various counts, while the categorical features are 4-byte hashed values of categorical strings. This hashing is primarily done as a security measure to protect privacy. We remark that this implies that while we can get insights into the number of unique values on each column (and combinations thereof), interpreting what these values might mean is not feasible due to the hashing.

A few points related to the data:

  1. Out of 24 days, the first 21 (day_0 and day_20) are stored in the directory raw/count; we use these 21 days of data for generating count features on the high dimensional categorical variables (explained in some detail in what follows).

  2. We use the data from day_21 as our training dataset; this is located in the directory raw/train.

  3. Finally, we use day_22 and day_23 as two separate test datasets. These are located in the directory raw/test. 

Data Preprocessing and Exploration Using Azure HDInsight Clusters

To explore the dataset and perform basic preprocessing before building ML models with Azure ML, we use an Azure HDInsight Hadoop cluster. For more information on how to create Hive tables on this dataset using an HDInsight cluster, please follow the walkthrough in this link; the rest of this section refers to sections in this walkthrough. The outcome of this walkthrough is to obtain the downsampled train and test datasets that are used in our model building below.

The section “Create Hive database and tables” describes creating Hive tables over the count, train, and test datasets. In what follows, we assume we have these tables built. Since other sections of the above walkthrough also describe feature exploration and how to downsample the data for use in Azure ML (in the section “Down sample the datasets for Azure ML”), we will not go into these topics in this blog post. Below is a short summary of the salient features of this dataset.

Number of examples: there are approximately 4.4 billion total examples in the dataset.

Number of training examples: we put approximately 192 million examples in the training dataset.

Number of test examples from the two datasets: For test data from day_22, we have about 189 million examples, while for test data from day_23, we have approximately 188 million examples.

Label distribution in the data: In this dataset, we have 3.3% positive examples (clicks with label “1” in Col0) and 96.7% negative examples (no-clicks with label “0” in Col0).

Understanding the number of unique values that the categorical variables take is of interest when building ML models, since high-dimensional categorical features can be challenging for some algorithms to handle. This dataset has 26 different categorical variables, so let’s take a look at a few of them to see how many unique values they take. To see how to do this, please refer to section “Find number of unique values for some categorical columns in the train dataset” in the above walkthrough. A summary for some columns follows.

Total number of unique values of a few categorical features :

                        Col15 : 19011825

                        Col16 : 30935

                        Col17 : 15200

We note that some categorical features have a large number of unique values. In the next subsection, we suggest an effective technique for dealing with such high dimensional categorical variables.

Dealing with high dimensional categorical features

Traditionally, categorical features are dealt with via one-hot encoding. While this approach works well when the categorical features have few values, it results in feature space explosion for high-cardinality categorical features and is thus unsuitable for them.

An efficient way of dealing with high-cardinality categorical features is a method called DRACuLa based on label-conditional counts for the categorical features. These conditional counts can then be directly used as feature vectors together with log-odds values derived from them.

Generating count features on the high dimensional categorical features using Azure ML

In Azure ML experiments that we illustrate below, we use the Build Counting Transform and Apply Transformation modules to build count features from categorical variables, and featurize the train and test datasets with them.

Featurization is built on the count dataset (day_0 to day_20) and use those counts as features on our train dataset (day_21) and the testdatasets (day_22 and day_23). For the Build Counting Transform module, we use the MapReduce option and set the number of classes to 2. We then provide credentials to the HDInsight cluster to be used for this computation (more information on how to do this is available in the documentation referenced above).

Model Building in Azure ML

After subsampling the data to create a training set for Azure ML training (please refer to the walkthrough), we save them in Azure ML workpace as Datasets. More information on how to do this is available via the Reader module documentation.

Modeling the problem using DRACuLa features and experimental results

Due to the skewed class distributions in our dataset, it is useful to downsample the negative examples in the train dataset so to have a 1:1 ratio of positive to negative examples.  We use this downsampled dataset as our training data for the experiment. After the class-balanced train dataset is created, we are ready to apply previously constructed count-based featurization on it.

As mentioned in an earlier blog post, count dataset is used for building count tables, which are used for featurization of the train and test datasets using resulting count features. Because all data from days 0 to 20 is utilized for building counts, resulting featurization is utilizing the complete class-conditional statistics from all available historical data.

We show a sample of the experiment to illustrate the set-up:

The Build Counting Transform module constructs the count table for the categorical features on a 970GB dataset allocated for counts using MapReduce on HDInsight (Hadoop). This module yields the data set “Criteo_count_table_transform” shown below. We then apply the resulting featurization using the Apply Transformation module to the train and test datasets. Applying the transformation essentially transforms the categorical features into class-conditional counts (and optionally log-odds if so desired). We show a part of the experiment that illustrates how these modules connect to each other.

 Once we have our training and test datasets ready, we are ready to learn a model; the learner chosen for this is the Two Class Boosted Decision Tree. Our trained model may then be applied on the test data to score it, like shown below.

The scoring portion of the experiment looks like so:

The overall experiment looks as follows:

Before we explore the results, we explain what the modules in the experiment mean.

The first set of R scripts applied to both the training and the test datasets just serves to give the data columns their correct names, like so:

dataset1 <- maml.mapInputPort(1) # class: data.frame
colnames(dataset1) <- c(“Col1″,”Col2″,”Col3″,”Col4″,”Col5″,”Col6″,”Col7″,”Col8″,”Col9″,”Col10″,”Col11″,”Col12″,”Col13″,”Col14”,
# Select data.frame to be sent to the output Dataset port

The second set of R scripts applied to both the training and the test datasets serves to balance the data so that the positive and negative classes are present in a ratio 1:1. For completeness, we include it below:

dataset1 <- maml.mapInputPort(1) # class: data.frame

# balance the classes so that pos-neg in a specified ratio

pos_neg_ratio <- 1

d_class0 <- subset(dataset1, Col1 == “0”)
d_class1 <- subset(dataset1, Col1 == “1”)

numRows_class0 <- nrow(d_class0)
numRows_class1 <- nrow(d_class1)

# downsample the negative class 0
numRows_class0_downsampled <- numRows_class1 * pos_neg_ratio

d_class0_downsampled <- d_class0[sample(numRows_class0, numRows_class0_downsampled, replace = FALSE), ]

# new output data frame containing 1:1 class ratios

data.set <- data.frame()

data.set <- rbind(data.set, d_class0_downsampled)

data.set <- rbind(data.set, d_class1)

# shuffle the rows

numRows_data.set <- nrow(data.set)
data.set <- data.set[sample(numRows_data.set, numRows_data.set, replace = FALSE), ]

# Select data.frame to be sent to the output Dataset port

A final set of R scripts is used after the “Score Model” module. This is used for computing the log loss and the script is shown below for convenience:

# Compute the log loss

# add guardrails when labels are 0,1

epsilon <- 1e-10
epsilon1 <- 1.0 – epsilon

ll <- function(actual, predicted)


    predicted <- max(epsilon, predicted)
    predicted <- min(epsilon1, predicted)

    score <- -(actual*log(predicted) + (1-actual)*log(1-predicted))
    score[actual==predicted] <- 0
    score[is.nan(score)] <- Inf

#’ Compute the mean log loss
logLoss <- function(actual, predicted) mean(ll(actual, predicted))

# Map 1-based optional input ports to variables
dataset1 <- maml.mapInputPort(1) # class: data.frame

actual <- dataset1[,1]
predicted <- dataset1[,41]

mll <- logLoss(actual, predicted)

# Compute relative across entropy
prior <- sum(actual)/length(actual)
priorLogLoss <- -(prior*log(prior) + (1-prior)*log(1-prior))
rap <- 100*(priorLogLoss-mll)/priorLogLoss

#cat(“Relative Across Entropy = “, rap)
out<- data.frame(mll,rap, prior, priorLogLoss)
colnames(out)<-c(“Mean log-loss”,”Relative log-loss”, “Prior”, “Prior log-loss”)

# Select data.frame to be sent to the output Dataset port

To cleanse the data of any missing values, we use the Clean Missing Data module. This module offers many options for cleaning data, as can be seen from the extensive documentation. In addition to the learner and the “Apply Transformation” modules that we already discussed, the experiment has a Train Model module for training and a subsequent Score Model module for scoring on the test dataset. Finally, we use an Evaluate Model module for evaluating model performance.

Although we have two days of test data, for simplicity, we show the testing experiment on the day_22 test dataset only. Since the problem is one of binary classification, an appropriate metric is the AUC. We use a confusion matrix, AUC, and ROC curves to summarize the prediction accuracy for this approach.

We mentioned that an R script computes the log loss. To see how our final log loss compares to the prior loss, we can compute the value. The result of this computation is shown below:

Modeling with no DRACuLa features and some experimental results

Having looked at the effect of using class-conditional count features (also referred to as DRACuLa features), we show now the corresponding experiment where no count features are used. Instead, in this case, the categorical features are modeled using one-hot encoding. We also note that the R scripts perform exactly the same function as the above experiment.

A snapshot of the full experiment looks as below:

We note that this experiment does not generate the count-based features, as mentioned. Since the rest of the modules are the same as our first experiment, we do not explain them in more detail here. 

Again, as this is a binary classification problem, we use the AUC as a reliable metric. We find:

The AUC when not using counts is actually lower than when count-based features are used in this experiment.

We mentioned that the final R script takes the output of the “Score Model” module and computes the log loss. We now show the results of this computation below:


The use of conditional counts (DRACuLa) features results in a compact representation of the high-cardinality categorical features present in the Criteo dataset. Moreover, we find that training times are about twice as fast after incorporating count features than without any count features. In addition, we find that the model performance is better on using the count-based DRACuLa features than without. This is because the count-based features provide a compact representation of the otherwise sparse high-dimensional categorical features.  

Big Learning Made Easy – with Counts!



This post is by Misha Bilenko, Principal Researcher in Microsoft Azure Machine Learning.

This week, Azure ML is launching exciting new capability for training on terabytes of data. It is based on a surprisingly simple yet amazingly robust learning algorithm that is widely used by practitioners, yet receives virtually no dedicated attention in ML literature or courses. At the same time, the algorithm has been mentioned in passing in numerous papers across many fields, dating back to literature on branch prediction in compilers from early 1990s. It is also the workhorse for several multi-billion-dollar applications of ML, such as online advertising and fraud detection. Known under different names – ‘historical statistics’, ‘risk tables’, ‘CTR features’ – it retains the same core technique under the covers of different applications. In this blog post, we introduce the general form of this learning with counts method (which we call ‘Dracula’– to avoid saying “Distributed Robust Algorithm for Count-based Learning” each time, and to honor this children’s classic). We illustrate its use in Azure ML, where it allows learning to scale to terabytes of data with just a few clicks, and summarize the aspects that make it a great choice for practitioners.

In many prediction problems, the most informative data attributes are identities of objects, as they can be directly associated with the historical statistics collected for each object. For example, in click probability prediction for online advertising, key objects are the anonymous user, the ad, and the context (e.g., a query or a webpage). Their identities may be captured by such attributes as the user’s anonymous ID, a machine’s IP, an identifier for the ad or advertiser, and the text or hash for the query or webpage URL.

Identity attributes, as well as their combinations, hold enormous predictive power: e.g. given a user’s historical propensity to react to ads, the likelihood that a given ad is clicked for a particular query, the tendency to click on a certain advertiser’s ads from a particular location etc. While attribute values can be encoded as one-of-many (also known as one-hot) binary features, this results in very high-dimensional representation, which, when training with terabytes of data, practically limits one to use linear models. Although using attribute combinations can mitigate the paucity of linear representation, the resulting high-dimensional parameters remain difficult to interpret or monitor.

An attractive alternative that allows easy inspection of the model is to directly associate each attribute or combination with the likelihoods (or propensities) mentioned above. Computing these conditional probabilities requires aggregating past data in a very simple data structure:  a set of count tables, where each table associates an attribute or combination with its historical counts for each label value. The following figure illustrates two count tables in the online advertising domain:


Number of clicks

Number of non-clicks










QueryHash, AdDomain

Number of clicks

Number of non-clicks












These tables allow easily computing click probability for each seen object or combination. In the example in the table above, user Bob clicks on ads 17/(17+235)=6.75% of the time, while users entering query with hash 598fd4fe click on the ad from a whopping 30% of the time (which will be less surprising until one realizes that 598fd4fe is the hash for query foo – in which case the ad is likely ‘navigational’, providing a shortcut to the most-desired search result for the query). 

These tables may remind the reader of Naïve Bayes, the classic learning algorithm that multiplies conditional probabilities for different attributes assuming their independence. While the simplicity of multiplying historical estimates is attractive, it can produce less-than-desirable accuracy when the independence assumptions are violated – as they inevitably are when the same object is involved in computing multiple estimates for different combinations. While Bayesian Networks extend Naïve Bayes by explicitly modeling relationships between attributes, they require encoding the dependency structure (or learning it), and often demand computationally expensive prediction (inference).

A slight departure from the rigor of Bayesian methods reveals a more general algorithm that maintains the simplicity of aggregating count tables, yet provides the freedom to utilize state-of-the-art algorithms such as boosted trees or neural networks to maximize the overall predictive accuracy. Instead of multiplying probabilities obtained from each table, one can choose to treat them as features – along with the raw per-label counts (or any transform thereof).  

Going back to the example above, the feature vector for estimating Bob’s probability to click on the ad for query 50fa3cc0 is {17; 235; 0.0675; 497; 10984; 0.045} – simply a concatenation of the counts and resulting probabilities. This representation can be used to train a powerful discriminative learner (such as boosted trees), which has the capacity to take into account all conditional probabilities, as well as the strength of evidence behind them (corresponding to actual counts). As done in Bayesian methods, probabilities are typically adjusted via standard smoothing techniques, such as adding pseudocounts based on priors.

One may wonder how this algorithm can work in domains where millions or billions of objects exist: wouldn’t we need extremely large tables, particularly for combinations that may be very rare? The method scales via two approaches that preserve most useful statistical information by combining rare-item counts deterministically or statistically: compressing the tables either via back-off, or via count-min sketches. Back-off is the simpler of the two: it involves including only rows that have some minimal number of counts, also keeping a single “back-off bin” where the counts for the rest are combined. An additional binary feature can then be added to the representation to differentiate between attribute values for which counts were stored, versus those that were looked up in the back-off bin. The alternative to back-off is count-min sketch – an elegant technique that was invented just over a decade ago. It stores the counts for each row in multiple hash tables indexed by independent hashing functions. The impact of collisions is reduced by taking the smallest of counts retrieved via the different hashes.

Next, we turn to Azure ML to summarize the algorithm and illustrate its basic use, as shown in the following screenshot from Azure ML Studio:

Data is input via the Reader module and passed to the Build Count Table module, which aggregates the statistics for specified attributes. The tables (along with metadata for smoothing) are then sent to the Count Featurizer module, which injects the count-based statistics and passes the resulting data representation to train boosted trees downstream. We note that the algorithm is not limited to binary classification: one can use it for regression or multi-class classification as well – the only difference being that instead of two count cells, more are needed, corresponding to discretization of the numeric label or multiple categories, respectively.

We conclude this post by summarizing the key benefits of learning with counts that made it so popular in the industry.  

  • First and foremost, the algorithm is simple to understand, inspect and monitor: whenever one desires to perform root-cause analysis for a particular prediction, they can directly examine the evidence supplied by the count tables for each attribute or combination. 

  • Second, the algorithm is highly modular in multiple aspects. It provides modeling modularity: one can experiment with different downstream learners utilizing the same tables, or construct multiple tables using different back-off parameters. It also provides modularity with respect to data: counts aggregated from multiple slices of data or different time periods can be added (or subtracted to remove problematic data).

  • It also provides elasticity to high data cardinality by automatically compressing tables using either back-off or count-min sketches.  

  • Finally, the algorithm is very well-suited for scaling out, as the count table aggregation is a classic map-reduce computation that can be easily distributed.  

In follow-up posts, we will discuss how the algorithm can be scaled out via Azure ML’s integration with HDInsight (Hadoop) to easily learn from terabytes of data, as well as techniques for avoiding overfitting and label leakage.