Ph.D. Candidate in Informatics @ PennState
11 August 2022
Photo by Amir-abbas Abdolali, Unsplash
This standalone project was undertaken during my summer 2022 internship at The Washington Post. The primary objective of the study was to identify potential audiences with a higher propensity to engage with specific advertisements using lookalike modeling techniques
Digital news platforms function as intermediaries, bridging the gap between advertisements and their target audiences to promote products or services. This interaction involves three pivotal stakeholders:
While the news platform captures audiences’ clickstream data, detailing which advertisements have been engaged with, predicting future engagement remains challenging. This issue extends beyond merely determining whether an audience will click on an advertisement; it evolves into a nuanced “click or unlabeled” problem. This is because the absence of a click (negative instances) doesn’t conclusively signify disinterest. In reality, such instances could represent either genuine disinterest or potential interest.
Within the scope of this research, I devised a lookalike modeling approach with three primary objectives:
A foundational premise of lookalike modeling is the belief that potential audiences, likely receptive to advertisements, exhibit traits mirroring those of existing users who have already engaged with the advertisement.
Lookalike modeling is commonly used to identify new potential users and expand audience base. The basic idea is pretty simple: given a seed set \(S\) from a universal set \(U\), find groups of audiences from \(U-S\) who look and act like the audiences in \(S\). Lookalike modeling can be addressed from the three different methodological approaches: rule-based, similarity-based, and model-based.
In this project, I adopted a model-based approach, specifically Positive-Unlabeled (PU) learning, which proceeds as follows:
Given a training set containing only positives (\(P\)) and unknown (\(U\)) classes:
While the process may seem straightforward on the surface, two central challenges must be addressed prior to implementing PU learning. The initial challenge pertains to sampling. Specifically, one must determine the optimal probability threshold for reliable negative (RN) instances. Furthermore, it is crucial to identify which sampling method is superior in both efficiency and effectiveness. The subsequent challenge relates to prediction, posing the question: Which predictive model yields the most favorable outcomes? Given the diverse sampling and prediction techniques available, and the potential combinations therein, a plethora of PU learning designs can be conceptualized.
To make the experiment simple, I used a specific advertisement which has been clicked 294 times out of 30,463 impressions. Therefore, there are 294 positive cases (\(S\)) and 30,463 unlabeled cases (\(U-S\)). After collecting the ad-related dataset such as the number of articles read by topic, clicking propensity, advertisement size and position, the number of line item, the number of impressions on a specific user, user device info, and user demographic info, I categorized them into three levels: article-level, ad-level, and user-level. The time period covered by the dataset is as below.
The first seven days of the whole advertising period were used for training and testing the look-alike modeling, and the remaining period was used for monitoring and evaluation.
Input: Positive Sample Set \(P\), unlabeled Sample Set \(U\)
Output: Negative Sample Set \(N\) with size \(k\)
The key idea of the Spy sampling is that the spies behave identically to the unknown positive users in \(U\).
Input: Positive Sample Set \(P\), unlabeled Sample Set \(U\)
Output: Negative Sample Set \(N\) with size \(k\)
In addition, since the dataset has very small number of positive cases (less than 0.01\%), SMOTE and SMOTE+RUS methods were applied to the training set.
The models based on bootstrap sampling show a better performance on test set than the models with spy sampling.
During the monitoring and evaluation period, users identified by the Spy+AdaBoost (AB) and Spy+Logistic Regression (LR) models exhibited a higher propensity to click on the advertisements compared to users selected at random.
Also, as shown in the figure below, the Spy+AB and Spy+LR models identified users who clicked on the advertisement more rapidly than other models.