The ‘Facebook Recruiting IV: Human or bot?’ competition just ended on Kaggle. For those unfamiliar with the competition, participants downloaded a table of about 7 million bids, which corresponded to another table of around 6,000 bidders. For 4,000 of those bidders, you had to estimate the probability that they were a human or bot, based on the remaining 2,000 bidders, whose bot status was given.
I was first on the public leaderboard for some time, but ended up coming in 17th on the private leaderboard. Still, given I’m a beginner I was pretty happy with the outcome, and I learned a lot. In this post I’ll give an overview of my approach to feature engineering and modelling, and share some of the lessons learned. Everything was done in R.
Each subheading below describes a general group of features that I extracted from the data and used in modelling. Features were estimated for each bidder_id in the training and test sets, and combined into a matrix called ‘bidderCharacteristics’.
Reading in the data
Total number of unique bids, countries, devices, IPs, URLs, merch categories for each bidder
The first set of features were just simple sums of the number of unique(x) where x is one of the variables in the bids table. The below code shows how I calculated the number of unique countries each bidder bid from.
Proportion of bids in each country, device, IP, URL, merchandise category
These features proved to be useful, particularly country. The below code shows how this was calculated for the country feature – a similar process was used for the other variables.
Given the sheer number of IPs and URLs, I limited the lists of these to only those IPs and URLs that were used for at least 1000 bids. This still ended up giving me about 600 IP variables and 300 URL variables. Correlation plots showed highly-correlated clusters of some URLs and IPs. To reduce dimensionality I tried using principal components analysis (PCA) on these variables. It seemed to help with URLs; in my final model I included the top 50 principal components from the URLs. Not much help was provided with IPs – I ended up using RandomForest importance scores on the full set to decide on which ones to include, which wasn’t many in the end. Here’s the code used to perform pca on the urls:
Mean popularity of country, device, IP, URL, merchandise categories used
I defined the ‘popularity’ of a particular country, device, or etc., as the number of unique bidder_id’s that bid from that variable. For each bidder, I then took the mean of these popularity scores given the countries, devices, etc., that they bid from. Here’s the code snippet used to calculate mean IP popularity:
Mean number of bids from countries, devices, IPs, etc., bidded from
Very similar to the previous feature: this looked at how many bids were made from each country, device, etc., and then gave each bidder the mean of these features across the countries, devices, etc., that the bidded from.
As it appears many others in this competition also realised, it became clear fairly early on to me in the competition that the bids were from three distinct three-day time periods, and that the time between the first and last bid was probably very close to exactly 31 days. Based on this information I could convert the obfuscated ‘time’ field into more meaningful units.
A number of other features stemmed from having this info on hand…
Proportion of bids in each hour of the day
A plot of bot density by hour of the day showed bots were more common during the ‘off-peak’ bidding periods. This suggested that taking the total proportion of a user’s bids in each hour of the day was likely to be a useful feature. The below code shows how I did this:
‘Bids per time’, mean time between bids, and ‘time active’
Bids per time and mean time between bids are self-explanatory. Time active I defined as the time between a bidder’s first and last bid.
I originally extracted these three features using the entire bids table at once. I later realised however, that the features could be skewed by the fact that there were three distinct time chunks. For instance, the mean time between a user’s bids was calculated by looking at the average length of time between a user’s bids. If a user had two bids in separate time chunks, this metric would be artificially inflated by the missing data between the time chunks.
Thus I figured out the ‘cut off times’ for each bid chunk’s start and end, divided into three chunks, extracted my features from each, and then took the overall means of the three features:
Proportion of auctions where a bidder was the last bidder
I took this feature as potentially meaning the bidder won the auction.
Mean duration of auctions a bidder participated in
This didn’t turn out to be particularly useful:
Variance in proportion of bids in each hour
The idea here was that a human might be more varied in terms of the hours of the day the bidded in, or maybe the opposite. For each of the 9 days I calculated the
Mean, variance, skewness and kurtosis of bids per auction, bids per device… auctions per device, auctions per county… and so on
Using the example of auctions per country, this feature was extracted by creating a table of bidder_id’s by countries and then placing the number of unique auctions in each country/bidder_id combination in the table cells. Row-wise mean, variance, skewness and kurtosis were then obtained. This was repeated for many possible combination of IPs, URLs, bids, auctions, devices, countries and hours:
Choice of model
To set a benchmark, I first tried modelling the bot probability using logistic regression. As expected this wasn’t particularly effective. RandomForest was the next model I tried. I also experimented with adaboost and extraTrees from the caret package, as well as xgboost. None of these were able to outperform RandomForest, however.
Addressing class imbalance
In the training set of some ~2000 bidders, there were only about 100 bots. I found I was able to improve both local cross validation (CV) and public leaderboard scores by over-sampling the bots prior to training the model. I achieved this through an R function:
Local training and testing - cross validation
While I did experiment with the cross validation features packaged with caret, I preferred the flexibilty of my own CV routine. I used 5- or 10-fold CV, depending on how much time I wanted to wait for resutls (usually used 10-fold).
I found my public leaderboard scores were usually higher than my CV scores, which I thought was a bit strange. I was probably overfitting the public leaderboard to some extent, or just getting lucky, because my final score on the private leaderboard ended up been much closer to my general CV performance.
The below loop gives the general gist of how I trained, tested and tuned my RF model using the training set:
Fitting the final model
After testing models out via cross validation, the below general code snippet was used to make submissions:
Variable selection for the final model
I didn’t end up using all of the variables generated during the feature engineering stage in my final model (there were some 1400 in total), though some of my best-scoring models did include as many as 1200 features. The ‘core’ model had around 315 predictor variables. These particular 315 came out of various tests using RF importance, balanced with my findings on what seemed to just work. When I added the mean/variance/skewness/kurtosis set of features, performance seemed to degrade, so a number of these features ended up being excluded. I tried to address the high dimensionality problem in various ways - reducing sets of highly-correlated variables, and removing variables with low RF importance scores - however none of these seemed to really improve performance. The takeaway from that for me was that RandomForest seems to be very effective at extracting all of the relevant information from the variables you give it, without being confounded by superfluous or barely-relevant variables. I’m not sure if this is always the case, but it seemed to be so here - removing variables that seemed like they should be useless in a statistical sense usually reduced model accuracy.
If you’re curious, here is the vector of ‘best’ variables that I used in the final model (50 URL principal components variables are all that’s excluded from this list):
Lessons learned and things to try next time
Here are some of the key things I learned from this competition, or things I might do differently next time:
The private leaderboard can be misleading; next time I will conduct more rigorous testing using local cross validation assessments rather than ‘trusting’ the public leaderboard.
Feature engineering is far more important than model selection or parameter tuning (beyond a certain point). Next time I’ll focus more on feature extraction and having a clear structure around my feature extraction/variable selection process.
Upon looking at some of the better-scoring participants solutions, I think it’s easy to see why I came 17th, and not higher. Their features were just a bit more logical/clever in terms of being able to pick out the bots. The structure of their feature extraction was also clearer.
Oversampling to address class imbalances can improve accuracy (at least in an ROC AUC sense).
Next time I’ll save all my plots as I go so I can include some more pretty pictures in a write-up like this!