A spectrogram of of the audio clips in the FAT2019 competition
The Freesound Audio Tagging 2019 (FAT2019) Kaggle competition just wrapped up. I didn’t place too well (my submission was ranked around 144th out of 408 on the private leaderboard). But winning wasn’t exactly my focus. I tried some interesting things and would like to share what I did, plus provide some explanations and code so others might be able to benefit from my work.
This post starts with a brief overview of the competition itself. Then I work chronologically through the main ideas I tried, introducing some of the theory behind each. I also provide some code snippets illustrating each method.
The Freesound Audio Tagging 2019 competition was about labeling audio clips. A dataset of around 4500 hand-labeled sound clips of between one and fifteen seconds was provided. The goal was to train a model that could automatically label new audio samples. There were 80 possible labels, ranging from ‘acoustic guitar’ to ‘race car’ to ‘screaming’. Audio samples could be tagged with one or more labels. Here are a few examples:
Labels = [Accelerating_and_revving_and_vroom, Motorcycle]
Labels = [Fill_(with_liquid)]
Labels = [Cheering, Crowd]
Like many other entrants, my starting point was a public kernel submitted by kaggler mhiro2. This kernel classified samples via a convolutional neural network (convnet) image classifier architecture. ‘Images’ of each audio clip were created by taking the log-mel spectrogram of the audio signal. 2-second subsets of the audio clips are randomly selected, and the model is then trained via a binary cross-entropy loss (as this is a multi-label classification task). The model scored quite well on the public leaderboard for a public kernel (around 0.610 if I remember correctly).
I was able to get a big boost in score (~0.610 –> ~0.639) through simply adding DenseNet-like skip connections to this kernel. I implemented skip connections by concatenating each network layer’s input with its output, prior to downsampling via average pooling.
Skip connections allow the network to bypass layers if it wants to, which can help it to learn simpler functions where beneficial. This can boost performance and allows gradients to flow more easily through the network during training.
The change is illustrated in my kernel fork and this code snippet:
Another key feature of this kernel was cosine annealing learning rate scheduling. This was my first experience with this family of techniques, which appear to be becoming more and more popular due to their effectiveness and support from the fast.ai community.
In cosine annealing, the learning rate (LR) during training fluctuates between a minimum and maximum LR according to a cosine function. The LR is updated at the end of each epoch according to this function.
The learning rate (y-axis) used in training over epochs (x-axis) when cosine annealing is enabled
The ideas behind cosine annealing LR were introduced in this paper. Often, cosine annealing leads to two main benefits:
The main theory behind why cosine annealing (or SGD with restarts) leads to better results is well-explained in this blog post. In short, there are two purported modes of action:
Cosine LR annealing to be seems to be a really effective technique. I’m also curious to dive into other practices advocated by the fast.ai crowd, namely one cycle policies and LR-finding.
Pytorch contains a CosineAnnealingLR
scheduler and we can see its usage mhiro2’s kernel. Basically:
The metric for this competition was lwlwrap (an implementation of this metric can be found here). Without going into too many details, it can be stated that lwlwrap works as a ranking metric. That is, it does not care what numerical score you assign to the target tag(s), only that that targets’ scores are higher than the scores for any other tags.
I theorised that using a hinge loss instead of binary cross-entropy might be more ideal for this task, since it too only cares that the scores for the target classes are higher than all others (binary cross-entropy, on the other hand, is somewhat more constrained in terms of the domain of the output scores). I used Pytorch’s MultiLabelMarginLoss
to implement a hinge loss for this purpose. This loss is defined as:
This basically encourages the model’s predicted scores for the target labels to be at least 1.0 larger than every single non-target label.
Unfortunately, despite seeming like a good idea on paper, switching to this loss function did not appear to provide any performance improvement in the competition.
From this point on, a lot of the things I tried centred around semi-supervised learning (SSL). Labeling data is a costly process, but unlabeled data is abundant. In SSL, we seek to benefit from unlabeled data by incorporating it into our model’s training loss, alongside the labeled data. SSL was the focus of my masters’ thesis.
In the FAT2019 competition, we were provided with an additional training dataset of around 20,000 audio samples. The labels on this dataset were ‘noisy’, however, as they were labeled by users. This thus seemed to me like a good place to apply SSL, by just treating these additional samples as unlabeled.
I tried quite a few SSL methods on the competition data; I cover each of these below.
Virtual adversarial training (VAT) is an SSL techinque that was shown to work very well in the image domain.
In VAT, we add small amounts of adversarial noise to images, then penalise our model for making different predictions on these images compared to the original images (source)
VAT is inspired by the idea of adversarial examples. It has been shown that, if we peer inside an image classifier, we can exploit it and make it misclassify an image by just making tiny changes to that image.
In VAT, we try to generate such adversarial examples on-the-fly during training, and then update our network by saying that its prediction should not change in response to such small changes.
To do this, we need to first find the adversarial direction: the direction we should move our image \( X \) towards such that the model’s prediction changes as much as possible.
To find the adversarial direction, we:
Initliase a random-normal tensor \( \mathbf{r} \) with the same shape as \( X \).
Calculate the gradient of \( \mathbf{r} \) with respect to \( KL(f(X), f(X + \mathbf{r})) \), where \( KL(f(\cdot), g(\cdot)) \) is the Kullback-Liebler divergence between two probability distribution functions \( g(\cdot)) \) and \( g(\cdot)) \).
The normalised direction of this gradient is our adversarial direction, which we call \( \mathbf{d} \).
Once we have \( \mathbf{d} \), we move \( X \) in that direction by some small scaling factor \( \epsilon \). We then add a term to our loss that penalises the difference in the model’s predictions, i.e.:
\[loss_{\text{unsupervised}}(X) = KL ( f(X), f(X + \epsilon * \mathbf{r}) ) \\ loss = loss_{\text{supervised}}(X, y) + loss_{\text{unsupervised}}(X)\]Since this \( loss_{\text{unsupervised}} \) term does not depend on any label \( y \), we can also use it with our unlabeled data
There is a great Pytorch implementation of VAT on github. With this implementation, adding VAT to a model is simple:
To use this repo for FAT2019 however, I needed to make a couple of changes to the implementation. The main problem is that it expects a classification model, so it uses softmax before the KL divergence over the classification distribution.
In our case, we use binary cross-entropy to predict a separate distribution for each label, rather than a distribution over labels. To overcome this I replaced the softmax with a sigmoid (where needed), and replaced the KL-divergence loss between the new and old predictions with the binary cross-entropy loss. For details, see the diffs between the Pytorch VAT repo and my fork.
Mean teacher held the previous state of the art for SSL on CIFAR10 and other datasets, before being beaten by Mixmatch (which I descibe below). It is relatively simple to implement. Unfortunately though it seemed to produce little or no benefit for me in the competition.
An overview of the mean teacher approach to SSL. A student model learns on a combination of a labeled dataset, and the predictions made by an exponential moving average of its history (the teacher model) (source)
In semi-supervised mean teacher:
Another technique I (and many other Kagglers) played around with was mixup. In basic mixup, we combine two images \( \mathbf{X}_1 \) and \( \mathbf{X}_2 \) with a factor \( \alpha \) to become a single image, \( \alpha \mathbf{X}_1 + (1 - \alpha) \mathbf{X}_2 \). We then train on these combined images with combined labels \( \alpha \mathbf{y}_1 + (1 - \alpha) \mathbf{y}_2 \). Though it seems strange to ‘combine’ images like this, this seems to have a regularisation effect on models, and leads to better generalisation and results in general.
Applying mixup to audio perhaps makes more sense, as it is quite natural to add pieces of audio together, at least in the frequency domain. In the spectral domain, I’m not sure if this is still so natural. Still, it was a popular technique in this technique that seemed to provide some performance boost.
Mixmatch is an SSL technique from Google Research. It achieved relatively large gains in SSL performance on CIFAR10 and other benchmarks, beating already-impressive state-of-the-art performance of other techniques.
(Mixmatch produces labels for unlabeled data points by averaging their predictions over many augmentations, and then *sharpening this average prediction source)*
In Mixmatch:
One difficulty with transferring this method to FAT2019, was that the idea of sharpening predictions is not as well-defined in the binary cross-entropy case. Since, as mentioned above, this is in fact a ranking problem, our model could still perform very well, even if only outputting very low confidence predictions for all classes.
To sharpen in the binary cross-entropy setting, we essentially (either explicitly or implicitly) need to define some threshold at which we call a prediction ‘confident’, and increase its label in the sharpening, or ‘unconfident’, and decrease its label. A natural choice for this would be 0.5.
Ultimately though, I could not get Mixmatch to perform well, and I think this may be due to the fact that many predictions are quite low confidence in the final-trained models, even though they represent the most confident class. Perhaps selecting the most confident classes and sharpening them by setting their labels to 1 would be a better approach.
]]>At PandaScore, we built a model to track the positions of each champion in a League of Legends (LoL) game, based solely on images of the minimap. In this more technical blog post, we describe how we achieved this.
PandaScore is the provider of static and real-time data for eSports. We cover a range of video games and tournaments, converting live in-game action into usable data for our customers. These customers range from media outlets, to betting market providers, to eSports teams themselves.
A core part of the work we do involves deep learning and computer vision. This is needed as we take video streams of live eSports matches, and convert them into data describing what is happening in the game.
The League of Legends (LoL) minimap is a great example of this work. For this particular task, our specific goal was to build an algorithm that can ‘watch’ the minimap, and output the (x, y) coordinates of each player on the minimap.
We saw creating this model as a high priority for our customers. Knowing the coordinates of each player in each moment of every game opens up a multitude of possibilities. The information could, for example, allow teams to better understand the effectiveness of their play strategies. It could be also be used to predict when certain events are going to happen in a game. Or it could be used to make more engaging widgets for spectators, with real-time stats.
Our customers expect the data we provide to be extremely accurate. Building a model that would be sufficiently reliable was far from an easy task however. We describe why in the next section.
In the deep learning literature, the type of problem that involves looking at images and locating or tracking objects in that image is generally referred to as object detection, or tracking.
On the surface, our particular minimap problem appears as though it could be easily solved with detection models such as YOLO or SSD. We would just need to label a large dataset of minimap crops with the positions of each champion, and then pass this dataset to one of these algorithms.
Indeed, this was the approach we tried first. Drawing on previous work on the LoL minimap problem done by Farzain Majeed in his DeepLeague project, we trained an SSD-style model on Farza’s DeepLeague100K dataset, and found it to work quite well on a held-out test set from his dataset.
There was one major problem with this approach however: the model did not generalise to champions not present in the dataset that it was trained on. We needed a model that would work for any champion a player happens to choose — a model that pushes errors if player chooses a rarely-picked or new champion would not be acceptable for customers of PandaScore.
We spent some weeks exploring a number of routes to resolving this issue. The main options were:
Manually annotate a lot more training data: we ruled this out as it would be too time-consuming to perform and maintain.
Train a model to detect the positions of any champion on the minimap, then feed the detected regions from this model to a classifier model covering all champions: this approach showed some promise early on, but was ultimately deemed unworkable.
Train a model on the raw champion ‘portraits’ **— the raw portrait images of each champion that the icons on the minimap are based on — then somehow **transfer this model to work in detecting the champions on real minimap frames.
We ultimately went with approach 3, which we describe in more detail in the next section.
The final approach we arrived at relied on a classifier that was trained on the raw champion portraits. If the classifier was only trained on these portraits, then we could be more certain that it would not give any preferential treatment to the champions that only occur in our minimap frames/hero coordinates training dataset.
The general idea here is to train a classifier on heavily-augmented versions of the raw champion portraits. We could then slide this trained classifier over minimap frames, resulting in a grid of predictions. At each square in this grid, we could extract the detection probabilities for each of the 10 champions we know are being played in the current game. These detection grids could then be fed to a second, champion-agnostic model that would learn to clean these up and output the correct (x, y) coordinates for each detected champion.
For the classifier however, we found that standard (albeit heavy) augmentation was insufficient to train a model on raw champion portraits that could reliably generalise to the champions as they appear on the minimap. We needed augmentations that could transform the raw portraits, such that they looked the same as they do on the minimap.
On the minimap, LoL champions appear with a blue or red circle around them. There can be explosions, pings, and other artifacts that also obfuscate the portraits. We experimented with crudely adding such artifacts manually. We found however, that the most effective approach was to learn a model that could generate such artifacts. We achieved this with a Generative Adversarial Network (GAN). In short, GANs are a neural network-based approach that allows us to learn a model that can generate data from a desired distribution (in our case, we essentially want to generate explosions, pings, blue or red circles, and other artifacts to add to the raw champion portraits). A general introduction to GANs can be found here.
Our particular use of GANs differs somewhat from the usual setup. We couldn’t just generate champion images in the minimap environment directly, as if we did this, our model would only learn to generate the around 50 out of 140 champions that are present in our minimap frames dataset.
Rather, in our case we needed to generate masks to add to raw champion portraits. The discriminator of the GAN would thus see the raw champion portrait plus the mask, and the generator would have to learn to change these masks such that the combination looks real. This is illustrated in the diagram below.
As the generator’s adversary, the discriminator tries to distinguish between ‘real’ images (crops of hero images taken directly from minimap frames), and ‘fake’ images (generated masks added to random hero portraits). After much tweaking effort and training time, we were able to train a mask-generating generator, which we put to use in the next section.
We now had a trained generator that was capable of producing masks that, when added to any raw champion portrait, would take us to a distribution of images that look (somewhat) like how that champion might appear on the minimap. We could thus train a classifier on this distribution, in the hopes that it would also work for detecting champions on real minimap frames.
The below diagram illustrates the training setup for this classifier:
This step is quite simple really. We just train an ordinary convolutional neural network (convnet) classifier C on our raw champion portraits, augmented by the GAN-generated masks. We use a shallow, wide classifier network with lots of dropout to prevent overfitting to the GAN-style data.
Our classifier is a fully-convolutional neural network that takes colour 24x24 ‘champion-on-the-minimap’ images as input and outputs a 1x1x(NumChampions + 1) tensor. We pass this tensor through a softmax nonlinearity to estimate class probabilities (the additional output channel is for a background class; we trained our classifier to also detect random patches of minimap with no champion and output a high ‘background’ probability).
If we instead pass an entire minimap crop of size 296x296 to this classifer, we get a 12x12x(NumChampions + 1) output. Each square of this 12x12 grid represents a region of the minimap, and in each of these squares we have the detection probabilities for each champion. We can increase the resolution of this ‘detection map’ to 70x70 by reducing the stride of the final two layers of our classifier (a convolution layer followed by an average pooling layer) to 1, from 2 (this trick has been applied elsewhere, e.g. in this work).
We slice out these ‘detection maps’ — as shown above— for each of the ten champions present in the current game. We also slice out the detection map for the background class. This 70x70x11 tensor then serves as the input to the final stage in our minimap model — a convolutional LSTM sequence model.
Very often, when champions are close to one another, one champion’s icon on the minimap will cover that of another. This poses issues for our classifier from the previous step, which cannot detect the champion that is being covered. As our customers rely upon the accuracy of our data feeds, we needed to address this issue. To do so, we enlisted a sequence model.
The idea here is that a sequence model can have some ‘memory’ of where the champions were last seen, and if they disappear suddenly, and another champion is nearby, then our model can ‘assume’ that the missing champion is probably just behind the nearby champion.
The above diagram presents the architecture of our sequence model. We take the 11 detection maps (D_it) extracted as described in the previous section (ten champions + one background), and pass each independently through the same convnet, which reduces their resolution and extracts relevant information. A low resolution copy of the minimap crop itself (M_t) is also passed through a separate convnet, the idea being that some low-resolution features about what is going on in the game might also be useful (e.g. if there is a lot of action, then non-detected champions are likely just hidden among that action).
The minimap and detection map features extracted from these convnets are then stacked into a single tensor of shape 35x35xF, where F is the total number of features (the minimap and detection map inputs were of size 70x70, and our convnets halved this resolution). We call this tensor r_t in the above diagram, as we have one of these tensors at each time step. These r_t are then fed sequentially into a convolutional LSTM (see this paper for conv-LSTM implementation details). We found switching from a regular LSTM to a convolutional LSTM to be hugely beneficial. Presumably, this was because the regular LSTM needed to learn the same ‘algorithm’ for each location on the minimap, whereas the conv-LSTM allowed this to be shared across locations.
At each time step, each of the convolutional LSTM’s 10 output channels (o_it, one i for each champion) is passed through the same dense (fully-connected) layer. This then outputs x and y coordinates for each champion. The mean squared error (MSE) between the output and target coordinates is then backpropagated to the weights of this network. The model converges after 6 or so hours of training on a single GPU (we trained on our own dataset of around 80 games, that was obtained in a similar way to that described by Farza).
We are still more rigourously evaluating our network before moving it into production. However results on our in-house test set suggest that more than 95% of all detections are within a 20 pixel radius of the target. Out of interest, we also tested the necessity of the GAN augmentation, but found performance to be substantially degraded when using standard augmentation alone, as opposed to augmenting with the GAN-generated masks. So it seems all our GAN training was not for nothing :)
This article is quite light on implementation details, and we’re sure some of our more technical readers will want to know more. If you have questions, please don’t hesitate to ask them here in the comments, or in the r/machinelearning thread.
]]>You can find the full code accompanying this post here.
t-SNE is an algorithm that lets us to do dimensionality reduction. This means we can take some data that lives in a high-dimensional space (such as images, which usually consist of thousands of pixels), and visualise it in a lower-dimensional space. This is desirable, as humans are much better at understanding data when it is presented in a two- or three-dimensional space.
Take MNIST for example, a classic dataset of images of handwritten digits from 0 to 9. MNIST images are 28x28 pixels, meaning they live in 784-dimensional space. With t-SNE, we can reduce this to just two dimensions, and get a picture like this:
MNIST images visualised in two dimesnions using t-SNE. Colours indicate the digit of each image. (via)
From here on, this article is focused on the implementation of t-SNE. If you want to understand more about dimensionality reduction in general, I recommend this great blog post from Chris Olah. If you’re interested in learning how to use t-SNE effectively, then definitely check this out.
t-distributed Stochastic Neighbor Embedding, or t-SNE, was developed by Geoffrey Hinton and Laurens van der Maaten. Their paper introducing t-SNE is very clear and easy to follow, and I more or less follow it in this post.
As suggested by the acronym, most of t-SNE is SNE, or the Stochastic Neighbor Embedding algorithm. We cover this first.
We have a dataset \(\mathbf{X}\), consisting of \(N\) data points. Each data point \(x_i\) has \(D\) dimensions. We wish to reduce this to \(d\) dimensions. Throughout this post we assume without loss of generality that \(d=2\).
SNE works by converting the euclidean distance between data points to conditional probabilities that represent similarities:
\[p_{j|i} = \frac{\exp \left ( - || x_i - x_j || ^2 \big / 2 \sigma_i^2 \right ) }{\sum_{k \neq i} \exp \left ( - || x_i - x_k || ^2 \big / 2 \sigma_i^2 \right )} \hspace{2em} (1)\]Essentially this is saying that the probability of point \(x_j\) being a neighbour of point \(x_i\) is proportional to the distance between these two points (we’ll see where the \(\sigma_i\)’s come from a bit later).
One thing to note here is that we set \( p_{i|i} = 0 \) for all \(i\), as we are not interested in how much of a neighbour each point is with itself.
Let’s introduce matrix \(\mathbf{Y}\).
\(\mathbf{Y}\) is an \(N\)x\(2\) matrix that is our 2D representation of \(\mathbf{X}\).
Based on \(\mathbf{Y}\) we can construct distribution \(q\) as per our construction of \(p\) (but without the \(\sigma\)’s):
\[q_{j|i} = \frac{\exp \left ( - || y_i - y_j || ^2 \right ) }{\sum_{k \neq i} \exp \left ( - || y_i - y_k || ^2 \right ) }\]Our overall goal is to pick the points in \(\mathbf{Y}\) such that this resulting conditional probability distribution \(q\) is similar to \(p\). This is achieved by minimising a cost: the KL-divergence between these two distributions. This is defined as follows:
\[C = \sum_i KL(P_i || Q_i) = \sum_i \sum_j p_{j|i} \log \frac {p_{j|i}} {q_{j|i}}\]We want to minimise this cost. Since we’re going to use gradient descent, we’re only really interested in its gradient with respect to our 2D representation \(\mathbf{Y}\). But more on that later.
Let’s code something. Both the formulas for \(p_{j|i}\) and \(q_{j|i}\) require the negative squared euclidean distance (this part: \(- || x_i - x_j || ^2 \)) between all pairs of points in a matrix.
In numpy we can implement this as:
This function uses a bit of linear algebra magic for efficiency, but it returns an \(N\)x\(N\) matrix whose \((i,j)\)’th entry is the negative squared euclidean disance between inputs points \(x_i\) and \(x_j\).
As someone who uses neural networks a lot, when I see \( \exp(\cdot) \big / \sum \exp(\cdot) \) like in \((1)\), I think softmax. Here is the softmax function we will use:
Note that we have taken care of the need for \( p_{i|i} = 0 \) by replacing the diagonal entries of the exponentiated negative distances matrix with zeros (using np.fill_diagonal
).
Putting these two functions together we can make a function that gives us a matrix \(P\), whose \((i,j)\)’th entry is \( p_{j|i} \) as defined in \((1)\):
In the previous code snippet, the sigmas
argument should be an \(N\)-length vector containing each of the \(\sigma_i\)’s. How do we get these \(\sigma_i\)’s? This is where perplexity comes into SNE. The perplexity of any of the rows of the conditional probabilities matrix \(P\) is defined as:
Here \(H(P_i)\) is the Shannon entropy of \(P_i\) in bits:
\[H(P_i) = - \sum_j p_{j|i} \log_2 p_{j|i}\]In SNE (and t-SNE) perplexity is a parameter that we set (usually between 5 and 50). We then set the \(\sigma_i\)’s such that for each row of \(P\), the perplexity of that row is equal to our desired perplexity – the parameter we set.
Let’s intuit about this for a moment. If a probability distribution has high entropy, it means that it is relatively flat – that is, the probabilities of most of the elements in the distribution are around the same.
Perplexity increases with entropy. Thus, if we desire higher perplexity, we want all of the \(p_{j|i}\) (for a given \(i\)) to be more similar to each other. In other words, we want the probability distribution \(P_i\) to be flatter. We can achieve this by increasing \(\sigma_i\) – this acts just like the temperature parameter sometimes used in the softmax function. The larger the \(\sigma_i\) we divide by, the closer the probability distribution gets to having all probabilities equal to just \(1/N\).
So, if we want higher perplexity it means we are going to set our \(\sigma_i\)’s to be larger, which will cause the conditional probability distributions to become flatter. This essentially increases the number of neighbours each point has (if we define \(x_i\) and \(x_j\) as neighbours if \(p_{j|i}\) is below a certain probability threshold). This is why you may hear people roughly equating the perplexity parameter to the number of neighbours we believe each point has.
To ensure the perplexity of each row of \(P\), \(Perp(P_i)\), is equal to our desired perplexity, we simply perform a binary search over each \(\sigma_i\) until \(Perp(P_i)=\) our desired perplexity.
This is possible because perplexity \(Perp(P_i)\) is a monotonically increasing function of \(\sigma_i\).
Here’s a basic binary search function in python:
To find our \(\sigma_i\), we need to pass an eval_fn
to this binary_search
function that takes a given \(\sigma_i\) as its argument and returns the perplexity of \(P_i\) with that \(\sigma_i\).
The find_optimal_sigmas
function below does exactly this to find all \(\sigma_i\)’s. It takes a matrix of negative euclidean distances and a target perplexity. For each row of the distances matrix, it performs a binary search over possible values of \(\sigma_i\) until finding that which results in the target perplexity. It then returns a numpy vector containing the optimal \(\sigma_i\)’s that were found.
We now have everything we need to estimate SNE – we have \(q\) and \(p\). We could find a decent 2D representation \(\mathbf{Y}\) by descending the gradient of the cost \(C\) with respect to \(\mathbf{Y}\) until convergence.
Since the gradient of SNE is a little bit trickier to implement however, let’s instead use Symmetric SNE, which is also introduced in the t-SNE paper as an alternative that is “just as good.”
In Symmetric SNE, we minimise a KL divergence over the joint probability distributions with entries \(p_{ij}\) and \(q_{ij}\), as opposed to conditional probabilities \(p_{i|j}\) and \(q_{i|j}\). Defining a joint distribution, each \(q_{ij}\) is given by:
\[q_{ij} = \frac{\exp \left ( - || y_i - y_j || ^2 \right ) }{\sum_{k \neq l} \exp \left ( - || y_k - y_l || ^2 \right ) } \hspace{2em} (2)\]This is just like the softmax we had before, except now the normalising term in the denominator is summed over the entire matrix, rather than just the current row.
To avoid problems related to outlier \(x\) points, rather than using an analogous distribution for \(p_{ij}\), we simply set \(p_{ij} = \frac{p_{i|j} + p_{j|i}}{2N}\).
We can easily obtain these newly-defined joint \(p\) and \(q\) distributions in python:
Let’s also define a p_joint
function that takes our data matrix \(\textbf{X}\) and returns the matrix of joint probabilities \(P\), estimating the required \(\sigma_i\)’s and conditional probabilities matrix along the way:
So we have our joint distributions \(p\) and \(q\). If we calculate these, then we can use the following gradient to update the \(i\)’th row of our low-dimensional representation \(\mathbf{Y}\):
\[\frac{\partial C}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) (y_i - y_j)\]In python, we can use the following function to estimate this gradient, given the joint probability matrices P and Q, and the current lower-dimensional representations Y.
To vectorise things, there is a bit of np.expand_dims
trickery here. You’ll just have to trust me that grad
is an \(N\)x\(2\) matrix whose \(i\)’th row is \(\frac{\partial C}{\partial y_i}\) (or you can check it yourself).
Once we have the gradients, as we are doing gradient descent, we update \(y_i\) through the following update equation:
\[y_i^{t} = y_i^{t-1} - \eta \frac{\partial C}{\partial y_i}\]So now we have everything we need to estimate Symmetric SNE.
This training loop function will perform gradient descent:
To keep things simple, we will fit Symmetric SNE to the first 200 0’s, 1’s and 8’s from MNIST. Here is a main()
function to do so:
You can find the load_mnist
function in the repo, which will prepare the dataset as specified.
Here’s what the results look like after running Symmetric SNE for 500 iterations:
Resulting two-dimensional representation of the first 200 0’s, 1’s and 8’s in the MNIST dataset, obtained via Symmetric SNE.
So we can see in this case Symmetric SNE is still quite capable of separating out the three different types of data that we have in our dataset.
Foei! That was a lot of effort. Fortunately to go from Symmetric SNE to t-SNE is simple. The only real difference is how we define the joint probability distribution matrix \(Q\), which has entries \(q_{ij}\). In t-SNE, this changes from \((2)\) to the following:
\[q_{ij} = \frac{ \left ( 1 + || y_i - y_j || ^2 \right ) ^ {-1} }{\sum_{k \neq l} \left ( 1 + || y_k - y_l || ^2 \right ) ^ {-1} } \hspace{2em} (3)\]This is derived by assuming the \(q_{ij}\) follow a Student t-distribution with one degree of freedom. Van der Maaten and Hinton note that this has the nice property that the numerator approaches an inverse square law for large distances in the low-dimensional space. Essentially, this means the algorithm is almost invariant to the general scale of the low-dimensional mapping. Thus the optimisation works in the same way for points that are very far apart as it does for points that are closer together.
This addresses the so-called ‘crowding problem:’ when we try to represent a high-dimensional dataset in two or three dimensions, it becomes difficult to separate nearby data points from moderately-far-apart data points – everything becomes crowded together, and this prevents the natural clusters in the dataset from becoming separated.
We can implement this new \(q_{ij}\) in python as follows:
Note that we used 1. - distances
instead of 1. + distances
as our distance function returns negative distances.
The only thing left to do now is to re-estimate the gradient of the cost with respect to \(\mathbf{Y}\). This gradient dervied in the t-SNE paper as:
\[\frac{\partial C}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) (y_i - y_j) \left ( 1 + || y_i - y_j || ^2 \right ) ^ {-1}\]Basically, we have just multiplied the Symmetric SNE gradient by the inv_distances
matrix we obtained halfway through the q_tsne
function shown just above (this is why we also returned this matrix).
We can easily implement this by just extending our earlier Symmetric SNE gradient function:
We saw in the call to estimate_sne
in our main()
function above that these two functions (q_tsne
and tsne_grad
) will be automatically passed to the training loop if TSNE = True
. Hence we just need to set this flag if we want TSNE instead of Symmetric SNE. Easy!
Setting this flag and running main()
gives the following 2D representation:
t-SNE representation of the first 200 0’s, 1’s and 8’s in the MNIST dataset after 500 iterations.
This looks a little better than the Symmetric SNE result above. When we scale up to more challenging cases, the advantages of t-SNE are clearer. Here are the results from Symmetric SNE versus t-SNE when we use the first 500 0’s, 1’s, 4’s, 7’s and 8’s from the MNIST dataset:
Symmetric SNE representation of the first 500 0’s, 1’s, 4’s, 7’s and 8’s in the MNIST dataset after 500 iterations.
t-SNE representation of the first 500 0’s, 1’s, 4’s, 7’s and 8’s in the MNIST dataset after 500 iterations.
It looks like the Symmetric SNE has had a harder time disentagling the classes than t-SNE, in this case.
Overall, the results look a tad lacklustre as, for simplicity, I’ve omitted a number of optimisation details from the original t-SNE paper (plus I used only 500 data points and barely tuned the hyperparameters).
Still, this exercise really helped me to properly understand how t-SNE works. I hope it had a similar effect for you.
Thanks for reading!
]]>Often though, I’ve found it to be a bit of a pain to integrate saving the embeddings correctly into my model training code. Plus there are plenty of non-Tensorflow-based vectors that I’d like to be able to easily visualise through this tool.
So I decided to throw together a function save_embeddings()
that takes the hassle out of this, allowing you to go straight from numpy arrays to Tensorboard-visualised embeddings. You can find the code here. Enjoy!
(Thanks to this Pinch of Intelligence post for some useful code snippets that I re-used for this).
]]>The authors set up their experiment as follows. We have three neural networks, named Alice, Bob, and Eve. Alice wishes to communicate an N bit message P to Bob. Alice and Bob also share a key (which you can think of as a password) of N bits.
Alice takes the message and the key, and encrypts the message, producing a communication C of N bits. Bob receives this communication, and then attempts to decrypt it, producing P_{Bob}.
Unfortunately for Bob and Alice, Eve intercepts Alice’s communication C. She then decrypts this message herself, producing her attempted recovery of P, which is called P_{Eve}.
Figure 1: The adversarial network setup diagram given in [1].
As mentioned, Alice, Bob and Eve are all neural networks. All three of these networks are quite similar.
Alice (Figure 2) takes as input the message and key vectors, concatenated into one long vector of length 2N. This then passes through a single fully-connected hidden layer of size 2N. It then passes through what I will refer to as the standard convolutional setup, which takes this 2N-length vector, passes it through a number of 1D convolution filters, and eventually outputs an N-length vector. This is the communication C that gets sent to Bob.
Figure 2: Alice’s neural network setup.
Bob’s network (Figure 3) is identical to Alice’s, except that his input is the concatenation of the communication and the key.
Figure 3: Bob’s neural network setup.
Eve’s network is also quite similar to Bob and Alice’s. However her input is just the communication C. She also has an additional fully-connected hidden layer of size 2N prior to the standard convolutional setup: the authors wanted to make Eve a bit more complex in order to give her a better chance of figuring out how to decrypt C.
Figure 4: Eve’s neural network setup.
The following table gives the output dimensionality of each layer in each network:
Layer | Output shape |
---|---|
Alice input | 2N |
Alice hidden | 2N |
Alice convolutional | (2, 2N+1) –> (4, N) –> (4, N) –> (1, N) |
Bob input | 2N |
Bob hidden | 2N |
Bob convolutional | As per Alice |
Eve hidden 1 | 2N |
Eve hidden 2 | 2N |
Eve convolutional | As per Alice |
The next ingredient are the loss functions:
Eve’s loss function is simply the absolute difference between the message vector P and her attempt at decrypting it P_{Eve}. This loss is backpropagated through the weights in the Eve network to update her weights.
Loss_{Eve} = Eve L1 error = abs(P - P_{Eve})
Bob and Alice’s loss function is also the absolute decryption error, plus an additional term that signifies how well Eve is currently decrypting the message. This additional term is:
(N / 2 - Eve L1 error)^{2} / (N / 2)^{2}
This is minimised when Eve gets half the bits correct, on average, in her decryption, which indicates that she is doing no better than random guessing. The gradient of this loss function is used to update the weights in both the Bob and Alice networks. Putting these together, we have:
Loss_{Bob} = Bob L1 error + (N / 2 - Eve L1 error)^{2} / (N / 2)
Training then proceeds in an adversarial way: we train Bob and Alice for a while until they are communicating effectively, then we train Eve for some time, until she learns to decrypt the message. Then we train Bob and Alice again, who figure out how to fool Eve’s current decryption method. Then we train Eve again, who cracks the improved encryption, and so on. The authors find that after a while, Bob and Alice’s encryption becomes too strong for Eve to be able to learn to crack it.
The implementation is fairly straightforward. I have used some custom classes, HiddenLayer()
and ConvLayer()
for adding standard NN layers. You can find these in the layers.py
file.
We first need to define our batch size and N up front. I have added the possibility to have different lengths for the key, message and communication, however I have not tested this; changing it might cause issues.
For Alice and Bob we just create a fairly straightforward sequential NN:
Eve is similarly implemented. We just need to use alice_comm
in her inputs.
Here we just implement the loss equations described in the previous section. Note that the additional term in Bob’s loss function is a bit simpler than the equation described above. Things have been set up such that a mean error of 1 means that half the bits were correctly decrypted (as bits are input as either -1 or 1, so a single error = 2). Hence the N/2 terms can be dropped from the implementation.
The only tricky-ish thing here is making sure that the training function for Alice and Bob updates all their parameters, while Eve’s only updates her parameters. I use lasagne.adam
for an implementation of the Adam SGD optimiser. I put the functions in dictionaries for ease of use in adversarial training.
Since it is used in all three networks, I made a custom class for the standard convolutional setup. It stores all the parameters and tensors relavent to all of the convolutional layers in the model. I have tried to match the description of the convolution setup described in the paper:
To perform the adversarial training, I made a train()
function that would train either Alice and Bob or Eve for some time. We then just iterate between calling this function on Alice and Bob, and then for Eve. The gen_data()
function generates batch_size
random message and key pairs. We train according to the loss, but for plotting we just store the decryption error for the party that is currently being trained.
I trained both Alice and Bob, and then Eve, for up to 2000 iterations at a time (early stopping occurred if the decryption error was below 0.01 for a while). I did 60 overall repetitions of this adversarial training setup. I then plotted the minimum decryption error achieved by Bob and by Eve in each of these 60 runs (Figure 5).
Figure 5: Bob and Eve’s decryption errors over 60 adversarial training iterations.
So, it seems to work. After a few adversarial rounds, Bob and Alice figure out a way to effectively scramble the communication such that Eve cannot learn how to decrypt it.
I also tested the setup without the four convolutional layers, instead replacing this with an additional 2N in, 1N out hidden layer (Figure 6).
Figure 6: Bob and Eve’s decryption errors over 60 adversarial training iterations, with the convolutional phase of the network excluded.
This seems to suggest that the convolution layers helps, but perhaps it is still possible to achieve the goals of this experiment without it - Eve still isn’t able to perfectly recover the message in this setup either.
I should note that this paper didn’t receive much love when it was posted on the Reddit MachineLearning forum. And I have to say I kind of agree with the points made in that discussion: really the fact that this works doesn’t mean it has created good encryption. Rather it more just speaks to the weakness of the Eve network in its ability to decrypt the message. This is sort of reflected by the fact that this setup still seems to work without the convolution layers (Figure 6). Still, it is an interesting idea, and I don’t think I’m in a position to judge its academic merit.
Thanks for reading - thoughts, comments or questions are welcome!
1: Abadi, M & Andersen, D. Learning to Protect Communications with Adversarial Neural Cryptography. October 24 2016. Google Brain.
]]>This post is also accompanied by a new, more complete and commented version of the code. I have also decided to upload the training and validation data I used.^{1}
If you just want to see or run the code yourself, then feel free to skip ahead.
The key differences between this and the last version are:
I now use the raw audio (downsampled to 11kHz) as the input to the neural network, rather than a spectogram
The raw audio vector is reshaped into a matrix of audio chunks, which are all processed in the same way. By default, 4 seconds of audio is represented by a 44100-length vector, which is reshaped into a matrix of size 441x100 (441 ‘timesteps’ with 100 ‘features’)
This input matrix is then fed into a neural network architecture that is much simpler: just a Convolution1D layer followed by max pooling
Training samples are now extracted randomly during training from the full-length audio files. This is achieved through the DataGen class. This means a much larger number of training samples can be synthesised without using additional memory
The Keras neural network specification now looks like this:
Let’s break this network topology down. We have:
max_pool
(=4 by default) filters, and a 1D convolution filter size of 3The only tricky thing here is the Convolution1D layer. It took me a while to figure out exactly what this is doing, and I couldn’t find it explained that clearly anywhere else on the web, so I’ll try to explain it here:
The Convolution1D layer takes an input matrix of shape Time Steps x Features
It then reduces this to something of shape Time Steps x 1. So essentially it reduces all of the features to just a single number
However, it does this for each filter, so if we have 16 filters, we end up with an output matrix of size Time Steps x 16
But, assuming we have just one filter, it convolves that 1D filter over the features. So if you imagine we have time steps on the horizontal axis, and features on the vertical axis, at every time step, it passes the same 1D convolution filter over the vertical axis
Then, a dense layer is used to go from a matrix of size Time Steps x Features to a matrix of size Time Steps x 1. So basically a weighted combination of the post-convolution features is taken at every time step. The same weights are used for all time steps
These last two steps are repeated for each filter, so actually we end up with an output matrix of size Features x Num. Filters
Here is the performance on three random four second sections of songs in the training set. You can see the audio in blue, the actual beats in black, and the predicted beats in red.
Figures: performance on some random four second clips of audio from the training set.
The predictions (red) might seem a bit noisy, but they’re still doing a pretty good job of picking up the onset of drum hits in the music.
The last figure shows how this is working more as an onset detector, rather than predicting where the actual beats of the bar are. In other words, it is working more as a ‘drum hits detector’ rather than a ‘beats in a bar detector.’
This is still useful however, and in some circumstances desirable. We can’t really expect our program to detect where the beats are by looking at just four seconds of audio. However by applying this predictor to larger sections of audio, the autocorrelation function of the predicted onsets can be used to infer the BPM (I experimented with this, and found it to be quite an accurate way of detecting the BPM: it was usually within 1 or 2 BPM of the actual song BPM in about 90% of tracks).
Another nice outcome is that performance on the validation data is pretty much the same as that on training data, meaning that the patterns the model has learned generalise well (at least to similar tracks, since I’ve only trained with electronic music):
Figures: performance on some random four second clips of audio from the validation set.
The last figure here also shows that the predictions are somewhat invariant to the amplitude (loudness) of the music being analysed.
To run the code yourself:
fit_nn.py
to train the neural network and visualise the resultswavs
subdirectory, and then running wavs_to_features.py
. These .wavs need to start exactly on the first beat, and have their true integer BPM as the first thing in the filename, followed by a space (see comments in code for further explanation).I played around with other, more complex network topologies, but found that this simple Convolution1D structure worked much better than anything else. Its simplicity also makes it very fast to train (at least on the quite old GPU in my laptop).
It is very curious that this network structure works so well. The output of each filter only has 441 timesteps, but it is clear that the predictions from the model are much more granular than this. It seems that certain filters are specialising in particular sections of each of the 441 time ‘chunks.’
In future it would be very interesting to drill down into the weights and see how this model is actually working. If anyone else wants to look into this then please do, and please share your findings!
Thanks for reading - would love to hear any thoughts!
1: I figure sharing 11kHz audio embedded in Python objects isn’t too bad a violation of copyright - if any of the material owners find this and disagree please feel free to get in contact with me and I will take it down.
]]>Initially I had to throw around a few ideas regarding the best way to represent the input audio, the BPM, and what would be an ideal neural network architecture.
One of the first decisions to make here is what general form the network’s input should take. I don’t know a whole lot about the physics side of audio, or frequency data more generally, but I am familiar with Fourier analysis, and spectograms.
I figured a frequency spectogram would serve as an appropriate input to whatever network I was planning on training. These basically contain time on the x-axis, and frequency bins on the y-axis. The values (pixel colour) then indicate the intensity of the audio signal at each frequency and time step.
An example frequency spectogram from a few seconds of electronic music. Note the kick drum on each beat in the lowest frequency bin.
I had a few different ideas here. First I thought I might try predicting the BPM directly. Then I decided I could save the network some trouble by having it try to predict the location of the beats in time. The BPM could then be inferred from this. I achieved this by constructing what I call a ‘pulse vector’ as follows:
Say we had a two second audio clip. We might represent this by a vector of zeroes of length 200 - a resolution of 100 frames per second.
Then say the tempo was 120 BPM, and the first beat was at the start of the clip. We could then create our target vector by setting (zero-indexed) elements [0, 50, 100, 150] of this vector to 1 (as 120 BPM implies 2 beats per second).
We can relatively easily infer BPM from this vector (though its resolution will determine how accurately). As a bonus, the network will also (hopefully) tell us where the beats are, in addition to just how often they occur. This might be useful, for instance if we wanted to synchronise two tracks together.
This image overlays the target output pulse vector (black) over the input frequency spectogram of a clip of audio.
My initial architecture involved just dense layers. I was working in Lasagne. I soon discovered the magic of Keras however, when looking for a way to apply the same dense layer to every time step. After switching to Keras, I also added a convolutional layer. So the current architecture is essentially a convolutional neural network. My intuition behind the inclusion and order of specific network layers is covered further below.
The main training data was obtained from my Traktor collection. Traktor is a DJing program, which is quite capable of detecting the BPM of the tracks you give it, particularly for electronic music. I have not had Traktor installed for a while, but a lot of the mp3 files in my music collection still have the Traktor-detected BPM stored with the file.
I copied around 30 of these mp3’s to a folder, however later realised that they still needed a bit more auditing - files needed to start exactly on the first beat, and needed to not get out of time throughout the song under the assumed BPM. Therefore I opened each in Reaper (a digital audio workstation), chopped each song to start on exactly the first beat, ensured they didn’t go out of time, and then exported them to wav.
Going from mp3/wav files to training data is all performed by the mp3s_to_fft_features.py
script.
~I then converted^{1} these to wav and read them into Python (using wavio). I also read the BPM from each mp3 into Python (using id3reader).~
-> I now already already have the songs in wav format, and the BPMs were read from the filenames, which I manually entered.
The wav is then converted to a spectogram. This was achieved by:
fft_sample_length
(default 768) every fft_step_size
(default 512) samplesThe target pulse vector matching the wav’s BPM is then created using the function get_target_vector
.
Then random subsets of length desired_X_time_dim
are taken in pairs from both the spectogram and target pulse vector. By this, we generate lots of training inputs and outputs that are a more manageable length from just the one set of training inputs. Each sample represents about 6 seconds of audio, with different offsets for where the beats are placed (so our model has to predict where the beats are, as well as how often they occur).
For each ~6 second sample, we now have a 512x32 matrix as training input - 512 time frames and 32 frequency bins (the number of frequency bins can be reduced by increasing the downsample
argument) - and a 512x1 pulse vector as training output.
In the latest version of the model, I have 18 songs to sample from. I create a training set by sampling from the first 13 songs, and validation and test sets by sampling from the last 5 songs. The training set contained 28800 samples.
As described above, I decided to go with a convolutional neural network architecture. It looked something like this:
An overview of the neural network architecture.
In words, the diagram/architecture can be described as follows:
The input spectogram is passed through two sequential convolutional layers
The output is then reshaped into a ‘time by other’ representation
Keras’ TimeDistributed Dense layers are then used (in these layers, each time step is passed through the same dense layer; this substantially reduces the number of parameters needed to be estimated)
Finally, the output is reduced to one dimension, and passed through some additional desnse layers before producing the output
The below code snippets give specific details as to the network architecture and its implementation in Keras.
First, we have two convolution layers:
I limited the amount of max-pooling. Max-pooling over the first dimension would reduce the time granularity, which I feel is important in our case, and in the second dimension we don’t have much granularity as it is (just the 32 frequency bins). Hence I only performed max pooling over the frequency dimension, and only once. I am still experimenting with the convolutional layers’ setup, but the current configuartion seems to produce decent results.
I then reshape the output of the convolution filters so that we again have a ‘time by other stuff’ representation. This allows us to add some TimeDistributed
layers. We have a matrix input of something like 512x1024 here, with the 1024 representing the outputs of all the convolutions. The TimeDistributed
layers allow us to go down to something like 512x256, but with only one (1024x256) weight matrix. This dense layer is then used at all time steps. In other words, these layers densely connect the outputs at each time step to the inputs in the corresponding time steps of the following layer. The overall benefit of this is that far fewer parameters need to be learned.
The intuition behind this is that if we have a 1024-length vector representing each time step, then we can probably learn a useful representation at a lower dimension of that time step, which will get us to a matrix size that will actually fit in memory when we try to add some dense layers afterwards.
Finally, we flatten everything and add a few dense layers. These simultaneously take into account both the time and frequency dimensions. This should be important, as the model can try to incorporate things like the fact that beats should be evenly spaced over time.
Usually the model got to a point where validation error stopped reducing after 9 or so epochs.
With the current configuration, the model appears to be able to detect beats in the music to some extent. Note that I’ve actually switched to inputs and outputs of length 160 (in the time dimension), though I was able to achieve similar results on the original 512-length data.
This first plot shows typical performance on audio clips within the training set:
Predicted (blue) vs actual (green) pulses - typical performance over the training set.
Performance is not as good when trying to predict pulse vectors derived from songs that were not in the training data. That said, on some songs the network still gets it (nearly) right. It also often gets the frequency of the beats correct, even though those beats are not in the correct position:
Predicted (blue) vs actual (green) pulses - typical performance over the validation set.
If we plot these predictions/actuals over the input training data, we can compare our own intuition to that of the neural network:
Predicted (black) vs actual (white) pulses plotted over spectogram - typical performance over the training set.
Take this one validation set example. I would find it hard to tell where the beats are by looking at this image, but the neural net manages to figure it out at least semi-accurately.
Predicted (black) vs actual (white) pulses plotted over spectogram - typical performance over the validation set.
This is still a work in progress, but I think the results show far have shown that this approach has potential. From here I’ll be looking to:
Use far more training data - I think many more songs are needed for the neural network to learn the general patterns that indicate beats in music
Read up on convolutional architectures to better understand what might work best for this particular situation
An approach I’ve been thinking might work better: adjust the network architecture to do ‘beat detection’ on shorter chunks of audio, then combine the output of this over a longer duration. This longer output can then serve as the input to a neural network that ‘cleans up’ the beat predictions by using the context of the longer duration
I still need to clean up the code a bit, but you can get a feel for it here.
I first thought of approaching this problem using a long-short term memory (LSTM) network. The audio signal would be fed in frame-by-frame as a frequency spectogram, and then at each step the network would output whether or not that time step represents the start of a beat. This is still an appealing prospect, however I decided to try a network architecture that I was more familiar with
I tried a few different methods for producing audio training data for the network. For the proof-of-concept phase, I created a bunch of wav’s with just sine tones at varying pitches, decaying quickly and played only on the beat, at various BPM’s. It was quite easy to get the network to learn to recognise the BPM from these. A step up from this was taking various tempo-synced break beats, and saving them down at different tempos. These actually proved difficult to learn from - just as hard as real audio files
It might be also interesting to try working with the raw wav data as the input
1: In the code, the function convert_an_mp3_to_wav(mp3_path, wav_path)
tells Linux to use mpg123 to convert the input mp3_path
to the output wav_path
. If you are on Linux, you may need to install mpg123. If you are using a different operating system, you may need to replace this with your own function that converts the input mp3 to the output wav.
I was first on the public leaderboard for some time, but ended up coming in 17th on the private leaderboard. Still, given I’m a beginner I was pretty happy with the outcome, and I learned a lot. In this post I’ll give an overview of my approach to feature engineering and modelling, and share some of the lessons learned. Everything was done in R.
Each subheading below describes a general group of features that I extracted from the data and used in modelling. Features were estimated for each bidder_id in the training and test sets, and combined into a matrix called ‘bidderCharacteristics’.
The first set of features were just simple sums of the number of unique(x) where x is one of the variables in the bids table. The below code shows how I calculated the number of unique countries each bidder bid from.
These features proved to be useful, particularly country. The below code shows how this was calculated for the country feature – a similar process was used for the other variables.
Given the sheer number of IPs and URLs, I limited the lists of these to only those IPs and URLs that were used for at least 1000 bids. This still ended up giving me about 600 IP variables and 300 URL variables. Correlation plots showed highly-correlated clusters of some URLs and IPs. To reduce dimensionality I tried using principal components analysis (PCA) on these variables. It seemed to help with URLs; in my final model I included the top 50 principal components from the URLs. Not much help was provided with IPs – I ended up using RandomForest importance scores on the full set to decide on which ones to include, which wasn’t many in the end. Here’s the code used to perform pca on the urls:
I defined the ‘popularity’ of a particular country, device, or etc., as the number of unique bidder_id’s that bid from that variable. For each bidder, I then took the mean of these popularity scores given the countries, devices, etc., that they bid from. Here’s the code snippet used to calculate mean IP popularity:
Very similar to the previous feature: this looked at how many bids were made from each country, device, etc., and then gave each bidder the mean of these features across the countries, devices, etc., that the bidded from.
As it appears many others in this competition also realised, it became clear fairly early on to me in the competition that the bids were from three distinct three-day time periods, and that the time between the first and last bid was probably very close to exactly 31 days. Based on this information I could convert the obfuscated ‘time’ field into more meaningful units.
A number of other features stemmed from having this info on hand…
A plot of bot density by hour of the day showed bots were more common during the ‘off-peak’ bidding periods. This suggested that taking the total proportion of a user’s bids in each hour of the day was likely to be a useful feature. The below code shows how I did this:
Bids per time and mean time between bids are self-explanatory. Time active I defined as the time between a bidder’s first and last bid.
I originally extracted these three features using the entire bids table at once. I later realised however, that the features could be skewed by the fact that there were three distinct time chunks. For instance, the mean time between a user’s bids was calculated by looking at the average length of time between a user’s bids. If a user had two bids in separate time chunks, this metric would be artificially inflated by the missing data between the time chunks.
Thus I figured out the ‘cut off times’ for each bid chunk’s start and end, divided into three chunks, extracted my features from each, and then took the overall means of the three features:
I took this feature as potentially meaning the bidder won the auction.
This didn’t turn out to be particularly useful:
The idea here was that a human might be more varied in terms of the hours of the day the bidded in, or maybe the opposite. For each of the 9 days I calculated the
Using the example of auctions per country, this feature was extracted by creating a table of bidder_id’s by countries and then placing the number of unique auctions in each country/bidder_id combination in the table cells. Row-wise mean, variance, skewness and kurtosis were then obtained. This was repeated for many possible combination of IPs, URLs, bids, auctions, devices, countries and hours:
To set a benchmark, I first tried modelling the bot probability using logistic regression. As expected this wasn’t particularly effective. RandomForest was the next model I tried. I also experimented with adaboost and extraTrees from the caret package, as well as xgboost. None of these were able to outperform RandomForest, however.
In the training set of some ~2000 bidders, there were only about 100 bots. I found I was able to improve both local cross validation (CV) and public leaderboard scores by over-sampling the bots prior to training the model. I achieved this through an R function:
While I did experiment with the cross validation features packaged with caret, I preferred the flexibilty of my own CV routine. I used 5- or 10-fold CV, depending on how much time I wanted to wait for resutls (usually used 10-fold).
I found my public leaderboard scores were usually higher than my CV scores, which I thought was a bit strange. I was probably overfitting the public leaderboard to some extent, or just getting lucky, because my final score on the private leaderboard ended up been much closer to my general CV performance.
The below loop gives the general gist of how I trained, tested and tuned my RF model using the training set:
After testing models out via cross validation, the below general code snippet was used to make submissions:
I didn’t end up using all of the variables generated during the feature engineering stage in my final model (there were some 1400 in total), though some of my best-scoring models did include as many as 1200 features. The ‘core’ model had around 315 predictor variables. These particular 315 came out of various tests using RF importance, balanced with my findings on what seemed to just work. When I added the mean/variance/skewness/kurtosis set of features, performance seemed to degrade, so a number of these features ended up being excluded. I tried to address the high dimensionality problem in various ways - reducing sets of highly-correlated variables, and removing variables with low RF importance scores - however none of these seemed to really improve performance. The takeaway from that for me was that RandomForest seems to be very effective at extracting all of the relevant information from the variables you give it, without being confounded by superfluous or barely-relevant variables. I’m not sure if this is always the case, but it seemed to be so here - removing variables that seemed like they should be useless in a statistical sense usually reduced model accuracy.
If you’re curious, here is the vector of ‘best’ variables that I used in the final model (50 URL principal components variables are all that’s excluded from this list):
Here are some of the key things I learned from this competition, or things I might do differently next time: