Jekyll2023-06-15T08:28:01+00:00https://nlml.github.io/feed.xmlnlmlnlml: thoughts on machine learningliam schoneveldSemi-Supervised Learning (and more): Kaggle Freesound Audio Tagging2019-06-29T00:00:00+00:002019-06-29T00:00:00+00:00https://nlml.github.io/kaggle/kaggle-freesound<p><em>An overview of semi-supervised learning and other techniques I applied to a recent Kaggle competition.</em></p>
<p><img src="/images/fat/spectro.png" alt="a spectrogram of an audio clip" /></p>
<p><em>A spectrogram of of the audio clips in the FAT2019 competition</em></p>
<p>The Freesound Audio Tagging 2019 (FAT2019) Kaggle competition just wrapped up. I didn’t place too well (my submission was ranked around 144th out of 408 on the private leaderboard). But winning wasn’t exactly my focus. I tried some interesting things and would like to share what I did, plus provide some explanations and code so others might be able to benefit from my work.</p>
<p>This post starts with a brief overview of the competition itself. Then I work chronologically through the main ideas I tried, introducing some of the theory behind each. I also provide some code snippets illustrating each method.</p>
<h2 id="the-competition">The competition</h2>
<p>The Freesound Audio Tagging 2019 competition was about labeling audio clips. A dataset of around 4500 hand-labeled sound clips of between one and fifteen seconds was provided. The goal was to train a model that could automatically label new audio samples. There were 80 possible labels, ranging from ‘acoustic guitar’ to ‘race car’ to ‘screaming’. Audio samples could be tagged with one or more labels. Here are a few examples:</p>
<p><audio ref="themeSong" src="https://raw.githubuserocntent.com/nlml/nlml.github.io/master/assets/0.mp3
" controls=""></audio></p>
<p><em>Labels = [Accelerating_and_revving_and_vroom, Motorcycle]</em></p>
<p><audio ref="themeSong" src="https://raw.githubusercontent.com/nlml/nlml.github.io/master/assets/2.mp3
" controls=""></audio></p>
<p><em>Labels = [Fill_(with_liquid)]</em></p>
<p><audio ref="themeSong" src="https://raw.githubusercontent.com/nlml/nlml.github.io/master/assets/3.mp3
" controls=""></audio></p>
<p><em>Labels = [Cheering, Crowd]</em></p>
<h2 id="my-starting-point---mhiro2s-public-kernel">My starting point - mhiro2’s public kernel</h2>
<p>Like many other entrants, my starting point was a <a href="https://www.kaggle.com/mhiro2/simple-2d-cnn-classifier-with-pytorch">public kernel</a> submitted by kaggler <em>mhiro2</em>. This kernel classified samples via a convolutional neural network (convnet) image classifier architecture. ‘Images’ of each audio clip were created by taking the <a href="https://en.wikipedia.org/wiki/Mel-frequency_cepstrum">log-mel spectrogram</a> of the audio signal. 2-second subsets of the audio clips are randomly selected, and the model is then trained via a binary cross-entropy loss (as this is a multi-label classification task). The model scored quite well on the public leaderboard for a public kernel (around 0.610 if I remember correctly).</p>
<h2 id="skip-connections">Skip connections</h2>
<p>I was able to get a big boost in score (~0.610 –> ~0.639) through simply adding <a href="https://arxiv.org/abs/1608.06993">DenseNet</a>-like skip connections to this kernel. I implemented skip connections by concatenating each network layer’s input with its output, prior to downsampling via average pooling.</p>
<h3 id="what-is-it">What is it?</h3>
<p>Skip connections allow the network to bypass layers if it wants to, which can help it to learn simpler functions where beneficial. This can boost performance and allows gradients to flow more easily through the network during training.</p>
<h3 id="implementation-for-fat2019">Implementation for FAT2019</h3>
<p>The change is illustrated in my <a href="https://www.kaggle.com/liamsch/simple-2d-cnn-classifier-with-pytorch">kernel fork</a> and this code snippet:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">input</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">conv1</span><span class="p">(</span><span class="nb">input</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">conv2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="c1"># If this layer is using skip-connection,
</span> <span class="c1"># we concatenate the input with its output:
</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">skip</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">([</span><span class="n">x</span><span class="p">,</span> <span class="nb">input</span><span class="p">],</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">avg_pool2d</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span></code></pre></figure>
<h2 id="cosine-annealing-learning-rate-scheduling">Cosine annealing learning rate scheduling</h2>
<p>Another key feature of this kernel was <strong>cosine annealing learning rate scheduling</strong>. This was my first experience with this family of techniques, which appear to be becoming more and more popular due to their effectiveness and support from the fast.ai community.</p>
<h3 id="what-is-it-1">What is it?</h3>
<p>In cosine annealing, the learning rate (LR) during training fluctuates between a minimum and maximum LR according to a cosine function. The LR is updated at the end of each epoch according to this function.</p>
<p><img src="/images/fat/cosine.png" alt="a spectrogram of an audio clip" /></p>
<p><em>The learning rate (y-axis) used in training over epochs (x-axis) when cosine annealing is enabled</em></p>
<p>The ideas behind cosine annealing LR were introduced in <a href="https://arxiv.org/abs/1608.03983">this paper</a>. Often, cosine annealing leads to two main benefits:</p>
<ul>
<li>Training is faster</li>
<li>A better final network is found - despite being faster to train, often the final model obtained produces better test set results than under traditional stochastic gradient descent (SGD)</li>
</ul>
<p>The main theory behind why cosine annealing (or SGD with restarts) leads to better results is well-explained in <a href="https://towardsdatascience.com/https-medium-com-reina-wang-tw-stochastic-gradient-descent-with-restarts-5f511975163">this blog post</a>. In short, there are two purported modes of action:</p>
<ol>
<li>The periods with a large learning rate allow the model to ‘jump’ out of bad local optima to better ones.</li>
<li>If a stable optimum is found that we <em>do not</em> jump out of when we return to a high learning rate, this optimum is likely more general and robust to shifts in the data distribution, and thus leads to better test performace.</li>
</ol>
<p>Cosine LR annealing to be seems to be a really effective technique. I’m also curious to dive into other practices advocated by the fast.ai crowd, namely <em><a href="https://towardsdatascience.com/finding-good-learning-rate-and-the-one-cycle-policy-7159fe1db5d6">one cycle policies</a></em> and <em><a href="https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0">LR-finding</a></em>.</p>
<h3 id="implementation-for-fat2019-1">Implementation for FAT2019</h3>
<p>Pytorch contains a <code class="language-plaintext highlighter-rouge">CosineAnnealingLR</code> scheduler and we can see its usage mhiro2’s kernel. Basically:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">torch.optim.lr_scheduler</span> <span class="kn">import</span> <span class="n">CosineAnnealingLR</span>
<span class="n">max_lr</span> <span class="o">=</span> <span class="mf">3e-3</span> <span class="c1"># Maximum LR
</span><span class="n">min_lr</span> <span class="o">=</span> <span class="mf">1e-5</span> <span class="c1"># Minimum LR
</span><span class="n">t_max</span> <span class="o">=</span> <span class="mi">10</span> <span class="c1"># How many epochs to go from max_lr to min_lr
</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">Adam</span><span class="p">(</span>
<span class="n">params</span><span class="o">=</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="n">max_lr</span><span class="p">,</span> <span class="n">amsgrad</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">scheduler</span> <span class="o">=</span> <span class="n">CosineAnnealingLR</span><span class="p">(</span>
<span class="n">optimizer</span><span class="p">,</span> <span class="n">T_max</span><span class="o">=</span><span class="n">t_max</span><span class="p">,</span> <span class="n">eta_min</span><span class="o">=</span><span class="n">min_lr</span><span class="p">)</span>
<span class="c1"># Training loop
</span><span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_epochs</span><span class="p">):</span>
<span class="n">train_one_epoch</span><span class="p">()</span>
<span class="n">scheduler</span><span class="p">.</span><span class="n">step</span><span class="p">()</span></code></pre></figure>
<h2 id="hinge-loss">Hinge loss</h2>
<p>The metric for this competition was <em>lwlwrap</em> (an implementation of this metric can be found <a href="https://www.kaggle.com/christoffer/lwlwrap">here</a>). Without going into too many details, it can be stated that lwlwrap works as a <em>ranking</em> metric. That is, it does not care what numerical score you assign to the target tag(s), only that that targets’ scores are higher than the scores for any other tags.</p>
<p>I theorised that using a hinge loss instead of binary cross-entropy might be more ideal for this task, since it too only cares that the scores for the target classes are higher than all others (binary cross-entropy, on the other hand, is somewhat more constrained in terms of the domain of the output scores). I used Pytorch’s <a href="https://pytorch.org/docs/stable/nn.html#multilabelmarginloss"><code class="language-plaintext highlighter-rouge">MultiLabelMarginLoss</code></a> to implement a hinge loss for this purpose. This loss is defined as:</p>
\[\text{loss}(x, y) = \sum_{ij}\frac{\max(0, 1 - (x[y[j]] - x[i]))}{\text{x.size}(0)}\]
<p>This basically encourages the model’s predicted scores for the target labels to be at least 1.0 larger than every single non-target label.</p>
<p>Unfortunately, despite seeming like a good idea on paper, switching to this loss function did not appear to provide any performance improvement in the competition.</p>
<h2 id="semi-supervised-learning">Semi-supervised learning</h2>
<p>From this point on, a lot of the things I tried centred around <em>semi-supervised learning</em> (SSL). Labeling data is a costly process, but unlabeled data is abundant. In SSL, we seek to benefit from unlabeled data by incorporating it into our model’s training loss, alongside the labeled data. SSL was the focus of my <a href="http://www.scriptiesonline.uba.uva.nl/635970">masters’ thesis</a>.</p>
<p>In the FAT2019 competition, we were provided with an additional training dataset of around 20,000 audio samples. The labels on this dataset were ‘noisy’, however, as they were labeled by users. This thus seemed to me like a good place to apply SSL, by just treating these additional samples as unlabeled.</p>
<p>I tried quite a few SSL methods on the competition data; I cover each of these below.</p>
<h2 id="virtual-adversarial-training">Virtual adversarial training</h2>
<p>Virtual adversarial training (VAT) is an SSL techinque that was <a href="https://arxiv.org/abs/1704.03976">shown</a> to work very well in the image domain.</p>
<p><img src="/images/fat/vat.png" alt="a spectrogram of an audio clip" /></p>
<p><em>In VAT, we add small amounts of adversarial noise to images, then penalise our model for making different predictions on these images compared to the original images (<a href="https://arxiv.org/abs/1704.03976">source</a>)</em></p>
<h3 id="what-is-it-2">What is it?</h3>
<p>VAT is inspired by the idea of adversarial examples. It has been shown that, if we peer inside an image classifier, we can exploit it and make it misclassify an image by just making tiny changes to that image.</p>
<p>In VAT, we try to generate such adversarial examples on-the-fly during training, and then update our network by saying that its prediction should not change in response to such small changes.</p>
<p>To do this, we need to first find the <em>adversarial direction</em>: the direction we should move our image \( X \) towards such that the model’s prediction changes as much as possible.</p>
<p>To find the adversarial direction, we:</p>
<ol>
<li>
<p>Initliase a random-normal tensor \( \mathbf{r} \) with the same shape as \( X \).</p>
</li>
<li>
<p>Calculate the gradient of \( \mathbf{r} \) with respect to \( KL(f(X), f(X + \mathbf{r})) \), where \( KL(f(\cdot), g(\cdot)) \) is the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Liebler divergence</a> between two probability distribution functions \( g(\cdot)) \) and \( g(\cdot)) \).</p>
</li>
<li>
<p>The normalised direction of this gradient is our adversarial direction, which we call \( \mathbf{d} \).</p>
</li>
</ol>
<p>Once we have \( \mathbf{d} \), we move \( X \) in that direction by some small scaling factor \( \epsilon \). We then add a term to our loss that penalises the difference in the model’s predictions, i.e.:</p>
\[loss_{\text{unsupervised}}(X) = KL ( f(X), f(X + \epsilon * \mathbf{r}) ) \\
loss = loss_{\text{supervised}}(X, y) + loss_{\text{unsupervised}}(X)\]
<p>Since this \( loss_{\text{unsupervised}} \) term does not depend on any label \( y \), we can also use it with our unlabeled data</p>
<h3 id="implementation-for-fat2019-2">Implementation for FAT2019</h3>
<p>There is a great Pytorch implementation of VAT on <a href="https://github.com/lyakaap/VAT-pytorch">github</a>. With this implementation, adding VAT to a model is simple:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">vat_loss</span> <span class="o">=</span> <span class="n">VATLoss</span><span class="p">(</span><span class="n">xi</span><span class="o">=</span><span class="mf">10.0</span><span class="p">,</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">ip</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># ... in training loop ...
</span><span class="n">lds</span> <span class="o">=</span> <span class="n">vat_loss</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">cross_entropy</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">target</span><span class="p">)</span> <span class="o">+</span> <span class="n">args</span><span class="p">.</span><span class="n">alpha</span> <span class="o">*</span> <span class="n">lds</span></code></pre></figure>
<p>To use this repo for FAT2019 however, I needed to make a couple of changes to the implementation. The main problem is that it expects a classification model, so it uses softmax before the KL divergence over the classification distribution.</p>
<p>In our case, we use binary cross-entropy to predict a separate distribution <em>for each</em> label, rather than a distribution <em>over</em> labels. To overcome this I replaced the softmax with a sigmoid (where needed), and replaced the KL-divergence loss between the new and old predictions with the binary cross-entropy loss. For details, see the diffs between the <a href="https://github.com/lyakaap/VAT-pytorch/blob/master/vat.py#L60">Pytorch VAT repo</a> and <a href="https://github.com/nlml/freesoundkaggle/blob/master/vat_loss.py#L67">my fork</a>.</p>
<h2 id="mean-teacher">Mean teacher</h2>
<p><a href="https://arxiv.org/abs/1703.01780">Mean teacher</a> held the previous state of the art for SSL on CIFAR10 and other datasets, before being beaten by Mixmatch (which I descibe below). It is relatively simple to implement. Unfortunately though it seemed to produce little or no benefit for me in the competition.</p>
<p><img src="/images/fat/mean_teacher.png" alt="a spectrogram of an audio clip" /></p>
<p><em>An overview of the mean teacher approach to SSL. A student model learns on a combination of a labeled dataset, and the predictions made by an exponential moving average of its history (the teacher model) (<a href="https://github.com/CuriousAI/mean-teacher">source</a>)</em></p>
<h3 id="what-is-it-3">What is it?</h3>
<p>In semi-supervised mean teacher:</p>
<ul>
<li>We keep two copies of our model - a <em>student</em> model, and a <em>teacher</em> model</li>
<li>Every <em>K</em> iterations (usually every epoch), we update our teacher model’s weights as an exponentially moving average (EMA) of the student model’s weights</li>
<li>The student model is trained as usual on the labeled data, but in addition:</li>
<li>We predict labels of our unlabeled data (plus random augmentation) using the teacher model. We then our student model for making different predictions on these same images (but with different random augmentation) to those predictions made by the teacher model.</li>
</ul>
<h3 id="implementation-for-fat2019-3">Implementation for FAT2019</h3>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># We need to make a copy of our model to be the teacher
</span><span class="n">ema_model</span> <span class="o">=</span> <span class="n">Classifier</span><span class="p">(</span><span class="n">num_classes</span><span class="o">=</span><span class="n">num_classes</span><span class="p">).</span><span class="n">cuda</span><span class="p">()</span>
<span class="c1"># This function updates the teacher model with the student
</span><span class="k">def</span> <span class="nf">update_ema_variables</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">ema_model</span><span class="p">,</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">global_step</span><span class="p">):</span>
<span class="c1"># Use the true average until the exponential average is more correct
</span> <span class="n">alpha</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="n">global_step</span> <span class="o">+</span> <span class="mi">1</span><span class="p">),</span> <span class="n">alpha</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ema_param</span><span class="p">,</span> <span class="n">param</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">ema_model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">()):</span>
<span class="n">ema_param</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">mul_</span><span class="p">(</span><span class="n">alpha</span><span class="p">).</span><span class="n">add_</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">param</span><span class="p">.</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># ... in training loop
</span><span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_epochs</span><span class="p">)</span>
<span class="c1"># Update the teacher model
</span> <span class="n">update_ema_variables</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">ema_model</span><span class="p">,</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">global_step</span><span class="p">)</span>
<span class="c1"># Predict unsupervised batch (with augmentation) with the teacher
</span> <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">ema_model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="n">teacher_pred</span> <span class="o">=</span> <span class="n">ema_model</span><span class="p">(</span><span class="n">unsup_data_aug1</span><span class="p">.</span><span class="n">cuda</span><span class="p">()</span>
<span class="c1"># We use sigmoid rather than softmax, as this is a
</span> <span class="c1"># multi-label tagging task, rather than classification
</span> <span class="n">unsup_targ</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">teacher_pred</span><span class="p">).</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># Predict unsupervised batch (with different augmentation)
</span> <span class="c1"># with the student and add error to the loss
</span> <span class="n">unsup_output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">unsup_data_aug2</span><span class="p">.</span><span class="n">cuda</span><span class="p">())</span>
<span class="n">loss_unsup</span> <span class="o">=</span> <span class="n">unsup_criterion</span><span class="p">(</span><span class="n">unsup_output</span><span class="p">,</span> <span class="n">unsup_targ</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">+=</span> <span class="n">loss_unsup</span> <span class="o">*</span> <span class="n">unsup_loss_weight</span></code></pre></figure>
<h2 id="mixup">Mixup</h2>
<p>Another technique I (and many other Kagglers) played around with was <a href="https://arxiv.org/abs/1710.09412">mixup</a>. In basic mixup, we combine two images \( \mathbf{X}_1 \) and \( \mathbf{X}_2 \) with a factor \( \alpha \) to become a single image, \( \alpha \mathbf{X}_1 + (1 - \alpha) \mathbf{X}_2 \). We then train on these combined images with combined labels \( \alpha \mathbf{y}_1 + (1 - \alpha) \mathbf{y}_2 \). Though it seems strange to ‘combine’ images like this, this seems to have a regularisation effect on models, and leads to better generalisation and results in general.</p>
<p>Applying mixup to audio perhaps makes more sense, as it is quite natural to add pieces of audio together, at least in the frequency domain. In the spectral domain, I’m not sure if this is still so natural. Still, it was a popular technique in this technique that seemed to provide some performance boost.</p>
<h2 id="mixmatch">Mixmatch</h2>
<p><a href="https://arxiv.org/abs/1905.02249">Mixmatch</a> is an SSL technique from Google Research. It achieved relatively large gains in SSL performance on CIFAR10 and other benchmarks, beating already-impressive state-of-the-art performance of other techniques.</p>
<p><img src="/images/fat/mixmatch.png" alt="The Mixmatch labeling procedure" /></p>
<p><em>(Mixmatch produces labels for unlabeled data points by averaging their predictions over many augmentations, and then *sharpening</em> this average prediction <a href="https://arxiv.org/abs/1905.02249">source</a>)*</p>
<h3 id="what-is-it-4">What is it?</h3>
<p>In Mixmatch:</p>
<ul>
<li>We make K augmentations of a given unlabeled image, then predict it with our model to get K predictions</li>
<li>We then average the K predictions to get a single prediction for that image</li>
<li>We then <em>sharpen</em> this average prediction, such that confident classes become more confident, and unconfident classes become even less confident</li>
<li>We then have labels for a batch of unlabeled data (plus our true labels for the batch of labeled data). We apply mixup over this whole set of labeled data, and train on it.</li>
</ul>
<h3 id="implementation-for-fat2019-4">Implementation for FAT2019</h3>
<p>One difficulty with transferring this method to FAT2019, was that the idea of <em>sharpening</em> predictions is not as well-defined in the binary cross-entropy case. Since, as mentioned above, this is in fact a ranking problem, our model could still perform very well, even if only outputting very low confidence predictions for all classes.</p>
<p>To sharpen in the binary cross-entropy setting, we essentially (either explicitly or implicitly) need to define some threshold at which we call a prediction ‘confident’, and increase its label in the sharpening, or ‘unconfident’, and decrease its label. A natural choice for this would be 0.5.</p>
<p>Ultimately though, I could not get Mixmatch to perform well, and I think this may be due to the fact that many predictions are quite low confidence in the final-trained models, even though they represent the most confident class. Perhaps selecting the most confident classes and sharpening them by setting their labels to 1 would be a better approach.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">sharpen</span><span class="p">(</span><span class="n">logit</span><span class="p">,</span> <span class="n">T</span><span class="p">):</span>
<span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">T</span> <span class="o">*</span> <span class="n">logit</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">sharpened_guess</span><span class="p">(</span><span class="n">ub</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span> <span class="n">T</span><span class="o">=</span><span class="mf">0.5</span><span class="p">):</span>
<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">was_training</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">training</span>
<span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="n">pr</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">model</span><span class="p">(</span><span class="n">ub</span><span class="p">))</span> <span class="c1"># shape = [B*K, 80]
</span> <span class="n">guess</span> <span class="o">=</span> <span class="n">pr</span><span class="p">.</span><span class="n">view</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="n">pr</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">//</span> <span class="n">K</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">).</span><span class="n">mean</span><span class="p">(</span><span class="mi">0</span><span class="p">).</span><span class="n">data</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">sharpen</span><span class="p">(</span><span class="n">guess</span><span class="p">,</span> <span class="n">T</span><span class="p">).</span><span class="n">repeat</span><span class="p">([</span><span class="n">K</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
<span class="k">if</span> <span class="n">was_training</span><span class="p">:</span>
<span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
<span class="k">return</span> <span class="n">out</span></code></pre></figure>Liam SchoneveldAn overview of semi-supervised learning and other techniques I applied to a recent Kaggle competition.Getting Champion Coordinates from the LoL Minimap using Deep Learning2018-06-25T00:00:00+00:002018-06-25T00:00:00+00:00https://nlml.github.io/neural-networks/getting-champion-coordinates-from-the-lol-minimap<p><em>Using a GAN and a ConvLSTM to go from minimap from to champion coordinates: This post was originally published on <a href="https://medium.com/pandascore-stories/league-of-legends-getting-champion-coordinates-from-the-minimap-using-deep-learning-48a49d35bb74">Medium</a>.</em></p>
<p><em>At PandaScore, we built a model to track the positions of each champion in a League of Legends (LoL) game, based solely on images of the minimap. In this more technical blog post, we describe how we achieved this.</em></p>
<h2 id="background">Background</h2>
<p><a href="https://pandascore.co">PandaScore</a> is <em>the</em> provider of static and real-time data for eSports. We cover a range of video games and tournaments, converting live in-game action into usable data for our customers. These customers range from media outlets, to betting market providers, to eSports teams themselves.</p>
<p>A core part of the work we do involves deep learning and computer vision. This is needed as we take video streams of live eSports matches, and convert them into data describing what is happening in the game.</p>
<p><img src="/images/lolmm/anim.gif" alt="Our champion-tracking model in action on a never-before-seen test video" /></p>
<p>The League of Legends (LoL) minimap is a great example of this work. For this particular task, our specific goal was to build an algorithm that can ‘watch’ the minimap, and output the (x, y) coordinates of each player on the minimap.</p>
<p>We saw creating this model as a high priority for our customers. Knowing the coordinates of each player in each moment of every game opens up a multitude of possibilities. The information could, for example, allow teams to better understand the effectiveness of their play strategies. It could be also be used to predict when certain events are going to happen in a game. Or it could be used to make more engaging widgets for spectators, with real-time stats.</p>
<p>Our customers expect the data we provide to be extremely accurate. Building a model that would be sufficiently reliable was far from an easy task however. We describe why in the next section.</p>
<h2 id="the-problem">The Problem</h2>
<p>In the deep learning literature, the type of problem that involves looking at images and locating or tracking objects in that image is generally referred to as <em>object detection</em>, or <em>tracking</em>.</p>
<p>On the surface, our particular minimap problem appears as though it could be easily solved with detection models such as <a href="https://arxiv.org/abs/1804.02767">YOLO</a> or <a href="https://arxiv.org/abs/1512.02325">SSD</a>. We would just need to label a large dataset of minimap crops with the positions of each champion, and then pass this dataset to one of these algorithms.</p>
<p>Indeed, this was the approach we tried first. Drawing on previous work on the LoL minimap problem done by Farzain Majeed in his <a href="https://medium.com/@farzatv/deepleague-leveraging-computer-vision-and-deep-learning-on-the-league-of-legends-mini-map-giving-d275fd17c4e0">DeepLeague project</a>, we trained an SSD-style model on Farza’s DeepLeague100K dataset, and found it to work quite well on a held-out test set from his dataset.</p>
<p>There was one major problem with this approach however: <strong>the model did not generalise to champions not present in the dataset that it was trained on</strong>. We needed a model that would work for any champion a player happens to choose — a model that pushes errors if player chooses a rarely-picked or new champion would not be acceptable for customers of PandaScore.</p>
<p>We spent some weeks exploring a number of routes to resolving this issue. The main options were:</p>
<ol>
<li>
<p><strong>Manually</strong> <strong>annotate a lot more training data</strong>: we ruled this out as it would be too time-consuming to perform and maintain.</p>
</li>
<li>
<p><strong>Train a model to detect the positions of <em>any</em> champion on the minimap, then feed the detected regions from this model to a classifier model covering all champions</strong>: this approach showed some promise early on, but was ultimately deemed unworkable.</p>
</li>
<li>
<p><strong>Train a model on the raw champion ‘portraits’ **— the raw portrait images of each champion that the icons on the minimap are based on — then somehow **transfer this model to work in detecting the champions on real minimap frames</strong>.</p>
</li>
</ol>
<p>We ultimately went with approach 3, which we describe in more detail in the next section.</p>
<h2 id="the-approach">The Approach</h2>
<p>The final approach we arrived at relied on a classifier that was trained on the raw champion portraits. If the classifier was only trained on these portraits, then we could be more certain that it would not give any preferential treatment to the champions that only occur in our minimap frames/hero coordinates training dataset.</p>
<p>The general idea here is to <strong>train a classifier</strong> on heavily-augmented versions of the raw champion portraits. We could then <strong>slide this trained classifier over minimap frames</strong>, resulting in a grid of predictions. At each square in this grid, we could extract the detection probabilities for each of the 10 champions we know are being played in the current game. These detection grids could then be fed to a second, champion-agnostic model that would learn to clean these up and output the correct (x, y) coordinates for each detected champion.</p>
<p>For the classifier however, we found that standard (albeit heavy) augmentation was insufficient to train a model on raw champion portraits that could reliably generalise to the champions as they appear on the minimap. <strong>We needed augmentations that could transform the raw portraits, such that they looked the same as they do on the minimap.</strong></p>
<p><img src="/images/lolmm/portrait_vs_minimap.png" alt="Ideally, we needed a model that could take a raw champion portrait (left), and make it look as though it were on the minimap (right)" /></p>
<p>On the minimap, LoL champions appear with a blue or red circle around them. There can be explosions, pings, and other artifacts that also obfuscate the portraits. We experimented with crudely adding such artifacts manually. We found however, that the most effective approach was to <strong>learn a model that could generate such artifacts</strong>. We achieved this with a Generative Adversarial Network (GAN). In short, GANs are a neural network-based approach that allows us to learn a model that can <em>generate</em> data from a desired distribution (in our case, we essentially want to generate explosions, pings, blue or red circles, and other artifacts to add to the raw champion portraits). A general introduction to GANs can be found <a href="http://blog.kaggle.com/2018/01/18/an-intuitive-introduction-to-generative-adversarial-networks/">here</a>.</p>
<h2 id="training-the-gan">Training the GAN</h2>
<p>Our particular use of GANs differs somewhat from the usual setup. We couldn’t just generate champion images in the minimap environment directly, as if we did this, our model would only learn to generate the around 50 out of 140 champions that are present in our minimap frames dataset.</p>
<p>Rather, in our case we needed to <strong>generate <em>masks</em> to add to raw champion portraits</strong>. The discriminator of the GAN would thus see the raw champion portrait <em>plus</em> the mask, and the generator would have to learn to change these masks such that the <em>combination</em> looks real. This is illustrated in the diagram below.</p>
<p><img src="/images/lolmm/gan2.png" alt="Diagram showing our GAN setup" /></p>
<p>As the generator’s adversary, the discriminator tries to distinguish between ‘real’ images (crops of hero images taken directly from minimap frames), and ‘fake’ images (generated masks added to random hero portraits). After much tweaking effort and training time, we were able to train a mask-generating generator, which we put to use in the next section.</p>
<h2 id="training-the-classifier">Training the Classifier</h2>
<p>We now had a trained generator that was capable of producing masks that, when added to any raw champion portrait, would take us to a distribution of images that look (somewhat) like how that champion might appear on the minimap. We could thus train a classifier on this distribution, in the hopes that it would also work for detecting champions on real minimap frames.</p>
<p>The below diagram illustrates the training setup for this classifier:</p>
<p><img src="/images/lolmm/clsf.png" alt="Diagram showing our classifier setup" /></p>
<p>This step is quite simple really. We just train an ordinary convolutional neural network (convnet) classifier <strong>C</strong> on our raw champion portraits, augmented by the GAN-generated masks. We use a shallow, wide classifier network with lots of dropout to prevent overfitting to the GAN-style data.</p>
<h2 id="calculating-the-detection-maps">Calculating the detection maps</h2>
<p>Our classifier is a fully-convolutional neural network that takes colour 24x24 ‘champion-on-the-minimap’ images as input and outputs a <strong>1x1</strong>x(NumChampions + 1) tensor. We pass this tensor through a softmax nonlinearity to estimate class probabilities (the additional output channel is for a background class; we trained our classifier to also detect random patches of minimap with no champion and output a high ‘background’ probability).</p>
<p>If we instead pass an entire minimap crop of size 296x296 to this classifer, we get a <strong>12x12</strong>x(NumChampions + 1) output. Each square of this <strong>12x12</strong> grid represents a region of the minimap, and in each of these squares we have the detection probabilities for each champion. We can increase the resolution of this ‘detection map’ to <strong>70x70</strong> by reducing the stride of the final two layers of our classifier (a convolution layer followed by an average pooling layer) to 1, from 2 (this trick has been applied elsewhere, <a href="https://arxiv.org/abs/1312.6229">e.g. in this work</a>).</p>
<p><img src="/images/lolmm/detectionmap.png" alt="Diagram showing the procedure for producing the detection maps, in this case for Janna (who here is the champion with white hair at the bottom left of the minimap, where our strongest detection also is)" /></p>
<p>We slice out these ‘detection maps’ — as shown above— for each of the ten champions present in the current game. We also slice out the detection map for the background class. This 70x70x11 tensor then serves as the input to the final stage in our minimap model — a convolutional LSTM sequence model.</p>
<h2 id="training-the-sequence-model">Training the sequence model</h2>
<p>Very often, when champions are close to one another, <strong>one champion’s icon on the minimap will cover that of another</strong>. This poses issues for our classifier from the previous step, which cannot detect the champion that is being covered. As our customers rely upon the accuracy of our data feeds, we needed to address this issue. To do so, we enlisted a sequence model.</p>
<p>The idea here is that a sequence model can have some ‘memory’ of where the champions were last seen, and if they disappear suddenly, and another champion is nearby, then our model can ‘assume’ that the missing champion is probably just behind the nearby champion.</p>
<p><img src="/images/lolmm/seq.png" alt="Diagram illustrating the sequence model architecture" /></p>
<p>The above diagram presents the architecture of our sequence model. We take the 11 detection maps (<strong>D_it</strong>) extracted as described in the previous section (ten champions + one background), and pass each independently through the same convnet, which reduces their resolution and extracts relevant information. A low resolution copy of the minimap crop itself (<strong>M_t</strong>) is also passed through a separate convnet, the idea being that some low-resolution features about what is going on in the game might also be useful (e.g. if there is a lot of action, then non-detected champions are likely just hidden among that action).</p>
<p>The minimap and detection map features extracted from these convnets are then stacked into a single tensor of shape 35x35xF, where F is the total number of features (the minimap and detection map inputs were of size 70x70, and our convnets halved this resolution). We call this tensor <strong>r_t</strong> in the above diagram, as we have one of these tensors at each time step. These <strong>r_t</strong> are then fed sequentially into a convolutional LSTM (see <a href="https://arxiv.org/abs/1506.04214">this paper</a> for conv-LSTM implementation details). We found switching from a regular LSTM to a convolutional LSTM to be hugely beneficial. Presumably, this was because the regular LSTM needed to learn the same ‘algorithm’ for each location on the minimap, whereas the conv-LSTM allowed this to be shared across locations.</p>
<p>At each time step, each of the convolutional LSTM’s 10 output channels (<strong>o_it</strong>, one <strong>i</strong> for each champion) is passed through the same dense (fully-connected) layer. This then outputs x and y coordinates for each champion. The mean squared error (MSE) between the output and target coordinates is then backpropagated to the weights of this network. The model converges after 6 or so hours of training on a single GPU (we trained on our own dataset of around 80 games, that was obtained in a similar way to that described by <a href="https://medium.com/@farzatv/deepleague-part-2-the-technical-details-374439e7e09a">Farza</a>).</p>
<h2 id="results">Results</h2>
<p>We are still more rigourously evaluating our network before moving it into production. However results on our in-house test set suggest that more than <strong>95% of all detections are within a 20 pixel radius of the target</strong>. Out of interest, we also tested the necessity of the GAN augmentation, but found performance to be substantially degraded when using standard augmentation alone, as opposed to augmenting with the GAN-generated masks. So it seems <strong>all our GAN training was not for nothing :)</strong></p>
<p>This article is quite light on implementation details, and we’re sure some of our more technical readers will want to know more. If you have questions, please don’t hesitate to ask them here in the comments, or in the r/machinelearning thread.</p>Liam SchoneveldUsing a GAN and a ConvLSTM to go from minimap from to champion coordinates: This post was originally published on Medium.In Raw Numpy: t-SNE2017-09-18T00:00:00+00:002017-09-18T00:00:00+00:00https://nlml.github.io/in-raw-numpy/in-raw-numpy-t-sne<p>This is the first post in the <em>In Raw Numpy</em> series. This series is an attempt to provide readers (and myself) with an understanding of some of the most frequently-used machine learning methods by going through the math and intuition, and implementing it using just python and numpy.</p>
<p>You can find the full code accompanying this post <a href="https://github.com/nlml/tsne_raw">here</a>.</p>
<h2 id="dimensionality-reduction">Dimensionality reduction</h2>
<p>t-SNE is an algorithm that lets us to do <em>dimensionality reduction</em>. This means we can take some data that lives in a high-dimensional space (such as images, which usually consist of thousands of pixels), and visualise it in a lower-dimensional space. This is desirable, as humans are much better at understanding data when it is presented in a two- or three-dimensional space.</p>
<p>Take <a href="https://www.tensorflow.org/get_started/mnist/beginners">MNIST</a> for example, a classic dataset of images of handwritten digits from 0 to 9. MNIST images are 28x28 pixels, meaning they live in 784-dimensional space. With t-SNE, we can reduce this to just two dimensions, and get a picture like this:</p>
<p><img src="/images/tsne/tsne-mnist.png" alt="t-SNE fit to the MNIST dataset" />
<em>MNIST images visualised in two dimesnions using t-SNE. Colours indicate the digit of each image. (<a href="https://bigsnarf.wordpress.com/2016/11/17/t-sne-attack-data/">via</a>)</em></p>
<p>From here on, this article is focused on the implementation of t-SNE. If you want to understand more about dimensionality reduction in general, I recommend <a href="http://colah.github.io/posts/2014-10-Visualizing-MNIST/">this great blog post from Chris Olah</a>. If you’re interested in learning how to use t-SNE effectively, then definitely <a href="https://distill.pub/2016/misread-tsne/">check this out</a>.</p>
<h2 id="before-t-sne-sne">Before t-SNE: SNE</h2>
<p><em>t-distributed Stochastic Neighbor Embedding</em>, or t-SNE, was developed by Geoffrey Hinton and Laurens van der Maaten. Their <a href="http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf">paper introducing t-SNE</a> is very clear and easy to follow, and I more or less follow it in this post.</p>
<p>As suggested by the acronym, most of t-SNE is SNE, or the <em>Stochastic Neighbor Embedding</em> algorithm. We cover this first.</p>
<h3 id="sne-setup-and-overall-goal">SNE: setup and overall goal</h3>
<p>We have a dataset \(\mathbf{X}\), consisting of \(N\) data points. Each data point \(x_i\) has \(D\) dimensions. We wish to reduce this to \(d\) dimensions. Throughout this post we assume without loss of generality that \(d=2\).</p>
<p>SNE works by converting the euclidean distance between data points to conditional probabilities that represent similarities:</p>
<p><a name="eq1"></a></p>
\[p_{j|i} = \frac{\exp \left ( - || x_i - x_j || ^2 \big / 2 \sigma_i^2 \right ) }{\sum_{k \neq i} \exp \left ( - || x_i - x_k || ^2 \big / 2 \sigma_i^2 \right )} \hspace{2em} (1)\]
<p>Essentially this is saying that the probability of point \(x_j\) being a neighbour of point \(x_i\) is proportional to the distance between these two points (we’ll see where the \(\sigma_i\)’s come from a bit later).</p>
<p>One thing to note here is that we set \( p_{i|i} = 0 \) for all \(i\), as we are not interested in how much of a neighbour each point is with itself.</p>
<p>Let’s introduce matrix \(\mathbf{Y}\).</p>
<p>\(\mathbf{Y}\) is an \(N\)x\(2\) matrix that is our 2D representation of \(\mathbf{X}\).</p>
<p>Based on \(\mathbf{Y}\) we can construct distribution \(q\) as per our construction of \(p\) (but without the \(\sigma\)’s):</p>
\[q_{j|i} = \frac{\exp \left ( - || y_i - y_j || ^2 \right ) }{\sum_{k \neq i} \exp \left ( - || y_i - y_k || ^2 \right ) }\]
<p>Our overall goal is to pick the points in \(\mathbf{Y}\) such that this resulting conditional probability distribution \(q\) is similar to \(p\). This is achieved by minimising a cost: the KL-divergence between these two distributions. This is defined as follows:</p>
\[C = \sum_i KL(P_i || Q_i) = \sum_i \sum_j p_{j|i} \log \frac {p_{j|i}} {q_{j|i}}\]
<p>We want to minimise this cost. Since we’re going to use gradient descent, we’re only really interested in its gradient with respect to our 2D representation \(\mathbf{Y}\). But more on that later.</p>
<h3 id="euclidean-distances-matrix-in-numpy">Euclidean distances matrix in numpy</h3>
<p>Let’s code something. Both the formulas for \(p_{j|i}\) and \(q_{j|i}\) require the negative squared euclidean distance (this part: \(- || x_i - x_j || ^2 \)) between all pairs of points in a matrix.</p>
<p>In numpy we can implement this as:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">neg_squared_euc_dists</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
<span class="s">"""Compute matrix containing negative squared euclidean
distance for all pairs of points in input matrix X
# Arguments:
X: matrix of size NxD
# Returns:
NxN matrix D, with entry D_ij = negative squared
euclidean distance between rows X_i and X_j
"""</span>
<span class="c1"># Math? See https://stackoverflow.com/questions/37009647
</span> <span class="n">sum_X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="n">X</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">D</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">X</span><span class="p">.</span><span class="n">T</span><span class="p">),</span> <span class="n">sum_X</span><span class="p">).</span><span class="n">T</span><span class="p">,</span> <span class="n">sum_X</span><span class="p">)</span>
<span class="k">return</span> <span class="o">-</span><span class="n">D</span></code></pre></figure>
<p>This function uses a bit of linear algebra magic for efficiency, but it returns an \(N\)x\(N\) matrix whose \((i,j)\)’th entry is the negative squared euclidean disance between inputs points \(x_i\) and \(x_j\).</p>
<p>As someone who uses neural networks a lot, when I see \( \exp(\cdot) \big / \sum \exp(\cdot) \) like in <a href="#eq1">\((1)\)</a>, I think softmax. Here is the softmax function we will use:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">softmax</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">diag_zero</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
<span class="s">"""Take softmax of each row of matrix X."""</span>
<span class="c1"># Subtract max for numerical stability
</span> <span class="n">e_x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">X</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">reshape</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]))</span>
<span class="c1"># We usually want diagonal probailities to be 0.
</span> <span class="k">if</span> <span class="n">diag_zero</span><span class="p">:</span>
<span class="n">np</span><span class="p">.</span><span class="n">fill_diagonal</span><span class="p">(</span><span class="n">e_x</span><span class="p">,</span> <span class="mf">0.</span><span class="p">)</span>
<span class="c1"># Add a tiny constant for stability of log we take later
</span> <span class="n">e_x</span> <span class="o">=</span> <span class="n">e_x</span> <span class="o">+</span> <span class="mf">1e-8</span> <span class="c1"># numerical stability
</span>
<span class="k">return</span> <span class="n">e_x</span> <span class="o">/</span> <span class="n">e_x</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">reshape</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span></code></pre></figure>
<p>Note that we have taken care of the need for \( p_{i|i} = 0 \) by replacing the diagonal entries of the exponentiated negative distances matrix with zeros (using <code class="language-plaintext highlighter-rouge">np.fill_diagonal</code>).</p>
<p>Putting these two functions together we can make a function that gives us a matrix \(P\), whose \((i,j)\)’th entry is \( p_{j|i} \) as defined in <a href="#eq1">\((1)\)</a>:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">calc_prob_matrix</span><span class="p">(</span><span class="n">distances</span><span class="p">,</span> <span class="n">sigmas</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="s">"""Convert a distances matrix to a matrix of probabilities."""</span>
<span class="k">if</span> <span class="n">sigmas</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">two_sig_sq</span> <span class="o">=</span> <span class="mf">2.</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="n">sigmas</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)))</span>
<span class="k">return</span> <span class="n">softmax</span><span class="p">(</span><span class="n">distances</span> <span class="o">/</span> <span class="n">two_sig_sq</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">softmax</span><span class="p">(</span><span class="n">distances</span><span class="p">)</span></code></pre></figure>
<h3 id="perplexed">Perplexed?</h3>
<p>In the previous code snippet, the <code class="language-plaintext highlighter-rouge">sigmas</code> argument should be an \(N\)-length vector containing each of the \(\sigma_i\)’s. How do we get these \(\sigma_i\)’s? This is where <strong>perplexity</strong> comes into SNE. The perplexity of any of the rows of the conditional probabilities matrix \(P\) is defined as:</p>
\[Perp(P_i) = 2^{H(P_i)}\]
<p>Here \(H(P_i)\) is the Shannon entropy of \(P_i\) in bits:</p>
\[H(P_i) = - \sum_j p_{j|i} \log_2 p_{j|i}\]
<p>In SNE (and t-SNE) perplexity is a <em>parameter</em> that we set (usually between 5 and 50). We then set the \(\sigma_i\)’s such that for each row of \(P\), the perplexity of that row is equal to our <em>desired</em> perplexity – the parameter we set.</p>
<p>Let’s intuit about this for a moment. If a probability distribution has high entropy, it means that it is relatively flat – that is, the probabilities of most of the elements in the distribution are around the same.</p>
<p>Perplexity increases with entropy. Thus, if we desire higher perplexity, we want all of the \(p_{j|i}\) (for a given \(i\)) to be more similar to each other. In other words, we want the probability distribution \(P_i\) to be flatter. We can achieve this by increasing \(\sigma_i\) – this acts just like the <a href="https://en.wikipedia.org/wiki/Softmax_function#Reinforcement_learning">temperature parameter sometimes used in the softmax function</a>. The larger the \(\sigma_i\) we divide by, the closer the probability distribution gets to having all probabilities equal to just \(1/N\).</p>
<p>So, if we want higher perplexity it means we are going to set our \(\sigma_i\)’s to be larger, which will cause the conditional probability distributions to become flatter. This essentially increases the number of neighbours each point has (if we define \(x_i\) and \(x_j\) as neighbours if \(p_{j|i}\) is below a certain probability threshold). This is why you may hear people roughly equating the perplexity parameter to the number of neighbours we believe each point has.</p>
<h3 id="finding-the-sigma_is">Finding the \(\sigma_i\)’s</h3>
<p>To ensure the perplexity of each row of \(P\), \(Perp(P_i)\), is equal to our desired perplexity, we simply perform a binary search over each \(\sigma_i\) until \(Perp(P_i)=\) our desired perplexity.</p>
<p>This is possible because perplexity \(Perp(P_i)\) is a monotonically increasing function of \(\sigma_i\).</p>
<p>Here’s a basic binary search function in python:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">binary_search</span><span class="p">(</span><span class="n">eval_fn</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">tol</span><span class="o">=</span><span class="mf">1e-10</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span>
<span class="n">lower</span><span class="o">=</span><span class="mf">1e-20</span><span class="p">,</span> <span class="n">upper</span><span class="o">=</span><span class="mf">1000.</span><span class="p">):</span>
<span class="s">"""Perform a binary search over input values to eval_fn.
# Arguments
eval_fn: Function that we are optimising over.
target: Target value we want the function to output.
tol: Float, once our guess is this close to target, stop.
max_iter: Integer, maximum num. iterations to search for.
lower: Float, lower bound of search range.
upper: Float, upper bound of search range.
# Returns:
Float, best input value to function found during search.
"""</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iter</span><span class="p">):</span>
<span class="n">guess</span> <span class="o">=</span> <span class="p">(</span><span class="n">lower</span> <span class="o">+</span> <span class="n">upper</span><span class="p">)</span> <span class="o">/</span> <span class="mf">2.</span>
<span class="n">val</span> <span class="o">=</span> <span class="n">eval_fn</span><span class="p">(</span><span class="n">guess</span><span class="p">)</span>
<span class="k">if</span> <span class="n">val</span> <span class="o">></span> <span class="n">target</span><span class="p">:</span>
<span class="n">upper</span> <span class="o">=</span> <span class="n">guess</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">lower</span> <span class="o">=</span> <span class="n">guess</span>
<span class="k">if</span> <span class="n">np</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">val</span> <span class="o">-</span> <span class="n">target</span><span class="p">)</span> <span class="o"><=</span> <span class="n">tol</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">return</span> <span class="n">guess</span></code></pre></figure>
<p>To find our \(\sigma_i\), we need to pass an <code class="language-plaintext highlighter-rouge">eval_fn</code> to this <code class="language-plaintext highlighter-rouge">binary_search</code> function that takes a given \(\sigma_i\) as its argument and returns the perplexity of \(P_i\) with that \(\sigma_i\).</p>
<p>The <code class="language-plaintext highlighter-rouge">find_optimal_sigmas</code> function below does exactly this to find all \(\sigma_i\)’s. It takes a matrix of negative euclidean distances and a target perplexity. For each row of the distances matrix, it performs a binary search over possible values of \(\sigma_i\) until finding that which results in the target perplexity. It then returns a numpy vector containing the optimal \(\sigma_i\)’s that were found.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">calc_perplexity</span><span class="p">(</span><span class="n">prob_matrix</span><span class="p">):</span>
<span class="s">"""Calculate the perplexity of each row
of a matrix of probabilities."""</span>
<span class="n">entropy</span> <span class="o">=</span> <span class="o">-</span><span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">prob_matrix</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">log2</span><span class="p">(</span><span class="n">prob_matrix</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">perplexity</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">**</span> <span class="n">entropy</span>
<span class="k">return</span> <span class="n">perplexity</span>
<span class="k">def</span> <span class="nf">perplexity</span><span class="p">(</span><span class="n">distances</span><span class="p">,</span> <span class="n">sigmas</span><span class="p">):</span>
<span class="s">"""Wrapper function for quick calculation of
perplexity over a distance matrix."""</span>
<span class="k">return</span> <span class="n">calc_perplexity</span><span class="p">(</span><span class="n">calc_prob_matrix</span><span class="p">(</span><span class="n">distances</span><span class="p">,</span> <span class="n">sigmas</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">find_optimal_sigmas</span><span class="p">(</span><span class="n">distances</span><span class="p">,</span> <span class="n">target_perplexity</span><span class="p">):</span>
<span class="s">"""For each row of distances matrix, find sigma that results
in target perplexity for that role."""</span>
<span class="n">sigmas</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># For each row of the matrix (each point in our dataset)
</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">distances</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span>
<span class="c1"># Make fn that returns perplexity of this row given sigma
</span> <span class="n">eval_fn</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">sigma</span><span class="p">:</span> \
<span class="n">perplexity</span><span class="p">(</span><span class="n">distances</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="p">:],</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">sigma</span><span class="p">))</span>
<span class="c1"># Binary search over sigmas to achieve target perplexity
</span> <span class="n">correct_sigma</span> <span class="o">=</span> <span class="n">binary_search</span><span class="p">(</span><span class="n">eval_fn</span><span class="p">,</span> <span class="n">target_perplexity</span><span class="p">)</span>
<span class="c1"># Append the resulting sigma to our output array
</span> <span class="n">sigmas</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">correct_sigma</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">sigmas</span><span class="p">)</span></code></pre></figure>
<h2 id="actually-lets-do-symmetric-sne">Actually… Let’s do Symmetric SNE</h2>
<p>We now have everything we need to estimate SNE – we have \(q\) and \(p\). We <em>could</em> find a decent 2D representation \(\mathbf{Y}\) by descending the gradient of the cost \(C\) with respect to \(\mathbf{Y}\) until convergence.</p>
<p>Since the gradient of SNE is a little bit trickier to implement however, let’s instead use Symmetric SNE, which is also introduced in the t-SNE paper as an alternative that is “just as good.”</p>
<p>In Symmetric SNE, we minimise a KL divergence over the joint probability distributions with entries \(p_{ij}\) and \(q_{ij}\), as opposed to conditional probabilities \(p_{i|j}\) and \(q_{i|j}\). Defining a joint distribution, each \(q_{ij}\) is given by:</p>
<p><a name="eq2"></a></p>
\[q_{ij} = \frac{\exp \left ( - || y_i - y_j || ^2 \right ) }{\sum_{k \neq l} \exp \left ( - || y_k - y_l || ^2 \right ) } \hspace{2em} (2)\]
<p>This is just like the softmax we had before, except now the normalising term in the denominator is summed over the entire matrix, rather than just the current row.</p>
<p>To avoid problems related to outlier \(x\) points, rather than using an analogous distribution for \(p_{ij}\), we simply set \(p_{ij} = \frac{p_{i|j} + p_{j|i}}{2N}\).</p>
<p>We can easily obtain these newly-defined joint \(p\) and \(q\) distributions in python:</p>
<ul>
<li>the joint \(p\) is just \( \frac {P + P^T} {2N } \), where \(P\) is the conditional probabilities matrix with \((i,j)\)’th entry \(p_{j|i}\)</li>
<li>to estimate the joint \(q\) we can calculate the negative squared euclidian distances matrix from \(\mathbf{Y}\), exponentiate it, then divide all entries by the total sum.</li>
</ul>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">q_joint</span><span class="p">(</span><span class="n">Y</span><span class="p">):</span>
<span class="s">"""Given low-dimensional representations Y, compute
matrix of joint probabilities with entries q_ij."""</span>
<span class="c1"># Get the distances from every point to every other
</span> <span class="n">distances</span> <span class="o">=</span> <span class="n">neg_squared_euc_dists</span><span class="p">(</span><span class="n">Y</span><span class="p">)</span>
<span class="c1"># Take the elementwise exponent
</span> <span class="n">exp_distances</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">distances</span><span class="p">)</span>
<span class="c1"># Fill diagonal with zeroes so q_ii = 0
</span> <span class="n">np</span><span class="p">.</span><span class="n">fill_diagonal</span><span class="p">(</span><span class="n">exp_distances</span><span class="p">,</span> <span class="mf">0.</span><span class="p">)</span>
<span class="c1"># Divide by the sum of the entire exponentiated matrix
</span> <span class="k">return</span> <span class="n">exp_distances</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">exp_distances</span><span class="p">),</span> <span class="bp">None</span>
<span class="k">def</span> <span class="nf">p_conditional_to_joint</span><span class="p">(</span><span class="n">P</span><span class="p">):</span>
<span class="s">"""Given conditional probabilities matrix P, return
approximation of joint distribution probabilities."""</span>
<span class="k">return</span> <span class="p">(</span><span class="n">P</span> <span class="o">+</span> <span class="n">P</span><span class="p">.</span><span class="n">T</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="mf">2.</span> <span class="o">*</span> <span class="n">P</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span></code></pre></figure>
<p>Let’s also define a <code class="language-plaintext highlighter-rouge">p_joint</code> function that takes our data matrix \(\textbf{X}\) and returns the matrix of joint probabilities \(P\), estimating the required \(\sigma_i\)’s and conditional probabilities matrix along the way:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">p_joint</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">target_perplexity</span><span class="p">):</span>
<span class="s">"""Given a data matrix X, gives joint probabilities matrix.
# Arguments
X: Input data matrix.
# Returns:
P: Matrix with entries p_ij = joint probabilities.
"""</span>
<span class="c1"># Get the negative euclidian distances matrix for our data
</span> <span class="n">distances</span> <span class="o">=</span> <span class="n">neg_squared_euc_dists</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="c1"># Find optimal sigma for each row of this distances matrix
</span> <span class="n">sigmas</span> <span class="o">=</span> <span class="n">find_optimal_sigmas</span><span class="p">(</span><span class="n">distances</span><span class="p">,</span> <span class="n">target_perplexity</span><span class="p">)</span>
<span class="c1"># Calculate the probabilities based on these optimal sigmas
</span> <span class="n">p_conditional</span> <span class="o">=</span> <span class="n">calc_prob_matrix</span><span class="p">(</span><span class="n">distances</span><span class="p">,</span> <span class="n">sigmas</span><span class="p">)</span>
<span class="c1"># Go from conditional to joint probabilities matrix
</span> <span class="n">P</span> <span class="o">=</span> <span class="n">p_conditional_to_joint</span><span class="p">(</span><span class="n">p_conditional</span><span class="p">)</span>
<span class="k">return</span> <span class="n">P</span></code></pre></figure>
<p>So we have our joint distributions \(p\) and \(q\). If we calculate these, then we can use the following gradient to update the \(i\)’th row of our low-dimensional representation \(\mathbf{Y}\):</p>
\[\frac{\partial C}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) (y_i - y_j)\]
<p>In python, we can use the following function to estimate this gradient, given the joint probability matrices P and Q, and the current lower-dimensional representations Y.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">symmetric_sne_grad</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">Q</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="n">_</span><span class="p">):</span>
<span class="s">"""Estimate the gradient of the cost with respect to Y"""</span>
<span class="n">pq_diff</span> <span class="o">=</span> <span class="n">P</span> <span class="o">-</span> <span class="n">Q</span> <span class="c1"># NxN matrix
</span> <span class="n">pq_expanded</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">pq_diff</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="c1">#NxNx1
</span> <span class="n">y_diffs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">Y</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">Y</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="c1">#NxNx2
</span> <span class="n">grad</span> <span class="o">=</span> <span class="mf">4.</span> <span class="o">*</span> <span class="p">(</span><span class="n">pq_expanded</span> <span class="o">*</span> <span class="n">y_diffs</span><span class="p">).</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="c1">#Nx2
</span> <span class="k">return</span> <span class="n">grad</span></code></pre></figure>
<p>To vectorise things, there is a bit of <code class="language-plaintext highlighter-rouge">np.expand_dims</code> trickery here. You’ll just have to trust me that <code class="language-plaintext highlighter-rouge">grad</code> is an \(N\)x\(2\) matrix whose \(i\)’th row is \(\frac{\partial C}{\partial y_i}\) (or you can check it yourself).</p>
<p>Once we have the gradients, as we are doing gradient descent, we update \(y_i\) through the following update equation:</p>
\[y_i^{t} = y_i^{t-1} - \eta \frac{\partial C}{\partial y_i}\]
<h3 id="estimating-symmetric-sne">Estimating Symmetric SNE</h3>
<p>So now we have everything we need to estimate Symmetric SNE.</p>
<p>This training loop function will perform gradient descent:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">estimate_sne</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="n">rng</span><span class="p">,</span> <span class="n">num_iters</span><span class="p">,</span> <span class="n">q_fn</span><span class="p">,</span> <span class="n">grad_fn</span><span class="p">,</span> <span class="n">learning_rate</span><span class="p">,</span>
<span class="n">momentum</span><span class="p">,</span> <span class="n">plot</span><span class="p">):</span>
<span class="s">"""Estimates a SNE model.
# Arguments
X: Input data matrix.
y: Class labels for that matrix.
P: Matrix of joint probabilities.
rng: np.random.RandomState().
num_iters: Iterations to train for.
q_fn: Function that takes Y and gives Q prob matrix.
plot: How many times to plot during training.
# Returns:
Y: Matrix, low-dimensional representation of X.
"""</span>
<span class="c1"># Initialise our 2D representation
</span> <span class="n">Y</span> <span class="o">=</span> <span class="n">rng</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="mf">0.</span><span class="p">,</span> <span class="mf">0.0001</span><span class="p">,</span> <span class="p">[</span><span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">2</span><span class="p">])</span>
<span class="c1"># Initialise past values (used for momentum)
</span> <span class="k">if</span> <span class="n">momentum</span><span class="p">:</span>
<span class="n">Y_m2</span> <span class="o">=</span> <span class="n">Y</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">Y_m1</span> <span class="o">=</span> <span class="n">Y</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="c1"># Start gradient descent loop
</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_iters</span><span class="p">):</span>
<span class="c1"># Get Q and distances (distances only used for t-SNE)
</span> <span class="n">Q</span><span class="p">,</span> <span class="n">distances</span> <span class="o">=</span> <span class="n">q_fn</span><span class="p">(</span><span class="n">Y</span><span class="p">)</span>
<span class="c1"># Estimate gradients with respect to Y
</span> <span class="n">grads</span> <span class="o">=</span> <span class="n">grad_fn</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">Q</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="n">distances</span><span class="p">)</span>
<span class="c1"># Update Y
</span> <span class="n">Y</span> <span class="o">=</span> <span class="n">Y</span> <span class="o">-</span> <span class="n">learning_rate</span> <span class="o">*</span> <span class="n">grads</span>
<span class="k">if</span> <span class="n">momentum</span><span class="p">:</span> <span class="c1"># Add momentum
</span> <span class="n">Y</span> <span class="o">+=</span> <span class="n">momentum</span> <span class="o">*</span> <span class="p">(</span><span class="n">Y_m1</span> <span class="o">-</span> <span class="n">Y_m2</span><span class="p">)</span>
<span class="c1"># Update previous Y's for momentum
</span> <span class="n">Y_m2</span> <span class="o">=</span> <span class="n">Y_m1</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">Y_m1</span> <span class="o">=</span> <span class="n">Y</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="c1"># Plot sometimes
</span> <span class="k">if</span> <span class="n">plot</span> <span class="ow">and</span> <span class="n">i</span> <span class="o">%</span> <span class="p">(</span><span class="n">num_iters</span> <span class="o">/</span> <span class="n">plot</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">categorical_scatter_2d</span><span class="p">(</span><span class="n">Y</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">ms</span><span class="o">=</span><span class="mi">6</span><span class="p">,</span>
<span class="n">show</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="k">return</span> <span class="n">Y</span></code></pre></figure>
<p>To keep things simple, we will fit Symmetric SNE to the first 200 0’s, 1’s and 8’s from MNIST. Here is a <code class="language-plaintext highlighter-rouge">main()</code> function to do so:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Set global parameters
</span><span class="n">NUM_POINTS</span> <span class="o">=</span> <span class="mi">200</span> <span class="c1"># Number of samples from MNIST
</span><span class="n">CLASSES_TO_USE</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">8</span><span class="p">]</span> <span class="c1"># MNIST classes to use
</span><span class="n">PERPLEXITY</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">SEED</span> <span class="o">=</span> <span class="mi">1</span> <span class="c1"># Random seed
</span><span class="n">MOMENTUM</span> <span class="o">=</span> <span class="mf">0.9</span>
<span class="n">LEARNING_RATE</span> <span class="o">=</span> <span class="mf">10.</span>
<span class="n">NUM_ITERS</span> <span class="o">=</span> <span class="mi">500</span> <span class="c1"># Num iterations to train for
</span><span class="n">TSNE</span> <span class="o">=</span> <span class="bp">False</span> <span class="c1"># If False, Symmetric SNE
</span><span class="n">NUM_PLOTS</span> <span class="o">=</span> <span class="mi">5</span> <span class="c1"># Num. times to plot in training
</span>
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="c1"># numpy RandomState for reproducibility
</span> <span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">RandomState</span><span class="p">(</span><span class="n">SEED</span><span class="p">)</span>
<span class="c1"># Load the first NUM_POINTS 0's, 1's and 8's from MNIST
</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">load_mnist</span><span class="p">(</span><span class="s">'datasets/'</span><span class="p">,</span>
<span class="n">digits_to_keep</span><span class="o">=</span><span class="n">CLASSES_TO_USE</span><span class="p">,</span>
<span class="n">N</span><span class="o">=</span><span class="n">NUM_POINTS</span><span class="p">)</span>
<span class="c1"># Obtain matrix of joint probabilities p_ij
</span> <span class="n">P</span> <span class="o">=</span> <span class="n">p_joint</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">PERPLEXITY</span><span class="p">)</span>
<span class="c1"># Fit SNE or t-SNE
</span> <span class="n">Y</span> <span class="o">=</span> <span class="n">estimate_sne</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">P</span><span class="p">,</span> <span class="n">rng</span><span class="p">,</span>
<span class="n">num_iters</span><span class="o">=</span><span class="n">NUM_ITERS</span><span class="p">,</span>
<span class="n">q_fn</span><span class="o">=</span><span class="n">q_tsne</span> <span class="k">if</span> <span class="n">TSNE</span> <span class="k">else</span> <span class="n">q_joint</span><span class="p">,</span>
<span class="n">grad_fn</span><span class="o">=</span><span class="n">tsne_grad</span> <span class="k">if</span> <span class="n">TSNE</span> <span class="k">else</span> <span class="n">symmetric_sne_grad</span><span class="p">,</span>
<span class="n">learning_rate</span><span class="o">=</span><span class="n">LEARNING_RATE</span><span class="p">,</span>
<span class="n">momentum</span><span class="o">=</span><span class="n">MOMENTUM</span><span class="p">,</span>
<span class="n">plot</span><span class="o">=</span><span class="n">NUM_PLOTS</span><span class="p">)</span></code></pre></figure>
<p>You can find the <code class="language-plaintext highlighter-rouge">load_mnist</code> function in the <a href="https://github.com/nlml/tsne_raw">repo</a>, which will prepare the dataset as specified.</p>
<h3 id="symmetric-sne-results">Symmetric SNE results</h3>
<p>Here’s what the results look like after running Symmetric SNE for 500 iterations:</p>
<p><img src="/images/tsne/symm-sne.png" alt="Symmetric SNE fit to two digits from the MNIST dataset" />
<em>Resulting two-dimensional representation of the first 200 0’s, 1’s and 8’s in the MNIST dataset, obtained via Symmetric SNE.</em></p>
<p>So we can see in this case Symmetric SNE is still quite capable of separating out the three different types of data that we have in our dataset.</p>
<h2 id="putting-the-t-in-t-sne">Putting the t in t-SNE</h2>
<p>Foei! That was a lot of effort. Fortunately to go from Symmetric SNE to t-SNE is simple. The only real difference is how we define the joint probability distribution matrix \(Q\), which has entries \(q_{ij}\). In t-SNE, this changes from <a href="#eq2">\((2)\)</a> to the following:</p>
<p><a name="eq3"></a></p>
\[q_{ij} = \frac{ \left ( 1 + || y_i - y_j || ^2 \right ) ^ {-1} }{\sum_{k \neq l} \left ( 1 + || y_k - y_l || ^2 \right ) ^ {-1} } \hspace{2em} (3)\]
<p>This is derived by assuming the \(q_{ij}\) follow a Student t-distribution with one degree of freedom. Van der Maaten and Hinton note that this has the nice property that the numerator approaches an inverse square law for large distances in the low-dimensional space. Essentially, this means the algorithm is almost invariant to the general scale of the low-dimensional mapping. Thus the optimisation works in the same way for points that are very far apart as it does for points that are closer together.</p>
<p>This addresses the so-called ‘crowding problem:’ when we try to represent a high-dimensional dataset in two or three dimensions, it becomes difficult to separate nearby data points from moderately-far-apart data points – everything becomes crowded together, and this prevents the natural clusters in the dataset from becoming separated.</p>
<p>We can implement this new \(q_{ij}\) in python as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">q_tsne</span><span class="p">(</span><span class="n">Y</span><span class="p">):</span>
<span class="s">"""t-SNE: Given low-dimensional representations Y, compute
matrix of joint probabilities with entries q_ij."""</span>
<span class="n">distances</span> <span class="o">=</span> <span class="n">neg_squared_euc_dists</span><span class="p">(</span><span class="n">Y</span><span class="p">)</span>
<span class="n">inv_distances</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">power</span><span class="p">(</span><span class="mf">1.</span> <span class="o">-</span> <span class="n">distances</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">np</span><span class="p">.</span><span class="n">fill_diagonal</span><span class="p">(</span><span class="n">inv_distances</span><span class="p">,</span> <span class="mf">0.</span><span class="p">)</span>
<span class="k">return</span> <span class="n">inv_distances</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">inv_distances</span><span class="p">),</span> <span class="n">inv_distances</span></code></pre></figure>
<p>Note that we used <code class="language-plaintext highlighter-rouge">1. - distances</code> instead of <code class="language-plaintext highlighter-rouge">1. + distances</code> as our distance function returns negative distances.</p>
<p>The only thing left to do now is to re-estimate the gradient of the cost with respect to \(\mathbf{Y}\). This gradient dervied in the t-SNE paper as:</p>
\[\frac{\partial C}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) (y_i - y_j) \left ( 1 + || y_i - y_j || ^2 \right ) ^ {-1}\]
<p>Basically, we have just multiplied the Symmetric SNE gradient by the <code class="language-plaintext highlighter-rouge">inv_distances</code> matrix we obtained halfway through the <code class="language-plaintext highlighter-rouge">q_tsne</code> function shown just above (this is why we also returned this matrix).</p>
<p>We can easily implement this by just extending our earlier Symmetric SNE gradient function:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">tsne_grad</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">Q</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="n">inv_distances</span><span class="p">):</span>
<span class="s">"""Estimate the gradient of t-SNE cost with respect to Y."""</span>
<span class="n">pq_diff</span> <span class="o">=</span> <span class="n">P</span> <span class="o">-</span> <span class="n">Q</span>
<span class="n">pq_expanded</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">pq_diff</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">y_diffs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">Y</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">Y</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="c1"># Expand our inv_distances matrix so can multiply by y_diffs
</span> <span class="n">distances_expanded</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">inv_distances</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="c1"># Multiply this by inverse distances matrix
</span> <span class="n">y_diffs_wt</span> <span class="o">=</span> <span class="n">y_diffs</span> <span class="o">*</span> <span class="n">distances_expanded</span>
<span class="c1"># Multiply then sum over j's
</span> <span class="n">grad</span> <span class="o">=</span> <span class="mf">4.</span> <span class="o">*</span> <span class="p">(</span><span class="n">pq_expanded</span> <span class="o">*</span> <span class="n">y_diffs_wt</span><span class="p">).</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">grad</span></code></pre></figure>
<h3 id="estimating-t-sne">Estimating t-SNE</h3>
<p>We saw in the call to <code class="language-plaintext highlighter-rouge">estimate_sne</code> in our <code class="language-plaintext highlighter-rouge">main()</code> function above that these two functions (<code class="language-plaintext highlighter-rouge">q_tsne</code> and <code class="language-plaintext highlighter-rouge">tsne_grad</code>) will be automatically passed to the training loop if <code class="language-plaintext highlighter-rouge">TSNE = True</code>. Hence we just need to set this flag if we want TSNE instead of Symmetric SNE. Easy!</p>
<p>Setting this flag and running <code class="language-plaintext highlighter-rouge">main()</code> gives the following 2D representation:</p>
<p><img src="/images/tsne/tsne.png" alt="t-SNE fit to two digits from the MNIST dataset" />
<em>t-SNE representation of the first 200 0’s, 1’s and 8’s in the MNIST dataset after 500 iterations.</em></p>
<p>This looks a little better than the Symmetric SNE result above. When we scale up to more challenging cases, the advantages of t-SNE are clearer. Here are the results from Symmetric SNE versus t-SNE when we use the first 500 0’s, 1’s, 4’s, 7’s and 8’s from the MNIST dataset:</p>
<p><img src="/images/tsne/symm-sne-2.png" alt="Symmetric SNE fit to five digits from the MNIST dataset" />
<em>Symmetric SNE representation of the first 500 0’s, 1’s, 4’s, 7’s and 8’s in the MNIST dataset after 500 iterations.</em></p>
<p><img src="/images/tsne/tsne-2.png" alt="t-SNE fit to five digits from the MNIST dataset" />
<em>t-SNE representation of the first 500 0’s, 1’s, 4’s, 7’s and 8’s in the MNIST dataset after 500 iterations.</em></p>
<p>It looks like the Symmetric SNE has had a harder time disentagling the classes than t-SNE, in this case.</p>
<h2 id="final-thoughts">Final thoughts</h2>
<p>Overall, the results look a tad lacklustre as, for simplicity, I’ve omitted a number of optimisation details from the original t-SNE paper (plus I used only 500 data points and barely tuned the hyperparameters).</p>
<p>Still, this exercise really helped me to properly understand how t-SNE works. I hope it had a similar effect for you.</p>
<p>Thanks for reading!</p>Liam SchoneveldThis is the first post in the In Raw Numpy series. This series is an attempt to provide readers (and myself) with an understanding of some of the most frequently-used machine learning methods by going through the math and intuition, and implementing it using just python and numpy.Using Tensorboard Embeddings Visualiser with Numpy Arrays2017-08-13T00:00:00+00:002017-08-13T00:00:00+00:00https://nlml.github.io/tensorflow/using-tensorboard-embeddings-visualiser-with-numpy-arrays<p>Tensorboard’s <a href="https://www.tensorflow.org/get_started/embedding_viz">embeddings visualiser</a> is great. You can use it to visualise and explore any set of high dimensional vectors (say, the activations of a hidden layer of a neural net) in a lower-dimensional space.</p>
<p><img src="/images/embs/embeddings-visualiser.png" alt="Tensorboard embedding visualiser in action" /></p>
<p>Often though, I’ve found it to be a bit of a pain to integrate saving the embeddings correctly into my model training code. Plus there are plenty of non-Tensorflow-based vectors that I’d like to be able to easily visualise through this tool.</p>
<p>So I decided to throw together a function <code class="language-plaintext highlighter-rouge">save_embeddings()</code> that takes the hassle out of this, allowing you to go straight from numpy arrays to Tensorboard-visualised embeddings. <a href="https://github.com/nlml/np-to-tf-embeddings-visualiser">You can find the code here</a>. Enjoy!</p>
<p>(Thanks to <a href="http://www.pinchofintelligence.com/simple-introduction-to-tensorboard-embedding-visualisation/">this Pinch of Intelligence post</a> for some useful code snippets that I re-used for this).</p>liam schoneveldTensorboard’s embeddings visualiser is great. You can use it to visualise and explore any set of high dimensional vectors (say, the activations of a hidden layer of a neural net) in a lower-dimensional space.Adversarial Neural Cryptography in Theano2016-11-05T00:00:00+00:002016-11-05T00:00:00+00:00https://nlml.github.io/neural-networks/adversarial-neural-cryptography<p>Last week I read Abadi and Andersen’s recent paper <a href="#cite1">[1]</a>, <a href="https://arxiv.org/pdf/1610.06918v1.pdf"><em>Learning to Protect Communications with Adversarial Neural Cryptography</em></a>. I thought the idea seemed pretty cool and that it wouldn’t be too tricky to implement, and would also serve as an ideal project to learn a bit more Theano. This post describes the paper, <a href="https://github.com/nlml/adversarial-neural-crypt">my implementation</a>, and the results.</p>
<h2 id="the-setup">The setup</h2>
<p>The authors set up their experiment as follows. We have three neural networks, named Alice, Bob, and Eve. Alice wishes to communicate an N bit message <strong><em>P</em></strong> to Bob. Alice and Bob also share a key (which you can think of as a password) of N bits.</p>
<p>Alice takes the message and the key, and encrypts the message, producing a communication <strong><em>C</em></strong> of N bits. Bob receives this communication, and then attempts to decrypt it, producing <strong><em>P<sub>Bob</sub></em></strong>.</p>
<p>Unfortunately for Bob and Alice, Eve intercepts Alice’s communication <strong><em>C</em></strong>. She then decrypts this message herself, producing her attempted recovery of <strong><em>P</em></strong>, which is called <strong><em>P<sub>Eve</sub></em></strong>.</p>
<p><img src="/images/crypt/fig1.png" alt="The adversarial network setup" />
<em>Figure 1: The adversarial network setup diagram given in <a href="#cite1">[1]</a>.</em></p>
<h3 id="neural-networks">Neural networks</h3>
<p>As mentioned, Alice, Bob and Eve are all neural networks. All three of these networks are quite similar.</p>
<p>Alice (Figure 2) takes as input the message and key vectors, concatenated into one long vector of length 2N. This then passes through a single fully-connected hidden layer of size 2N. It then passes through what I will refer to as the <em>standard convolutional setup</em>, which takes this 2N-length vector, passes it through a number of 1D convolution filters, and eventually outputs an N-length vector. This is the communication <strong><em>C</em></strong> that gets sent to Bob.</p>
<p><img src="/images/crypt/alice.png" alt="Alice's neural network configuration" /></p>
<p><em>Figure 2: Alice’s neural network setup.</em></p>
<p>Bob’s network (Figure 3) is identical to Alice’s, except that his input is the concatenation of the communication and the key.</p>
<p><img src="/images/crypt/bob.png" alt="Bob's neural network configuration" /></p>
<p><em>Figure 3: Bob’s neural network setup.</em></p>
<p>Eve’s network is also quite similar to Bob and Alice’s. However her input is just the communication <strong><em>C</em></strong>. She also has an additional fully-connected hidden layer of size 2N prior to the standard convolutional setup: the authors wanted to make Eve a bit more complex in order to give her a better chance of figuring out how to decrypt <strong><em>C</em></strong>.</p>
<p><img src="/images/crypt/eve.png" alt="Eve's neural network configuration" /></p>
<p><em>Figure 4: Eve’s neural network setup.</em></p>
<p>The following table gives the output dimensionality of each layer in each network:</p>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Output shape</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice input</td>
<td>2N</td>
</tr>
<tr>
<td>Alice hidden</td>
<td>2N</td>
</tr>
<tr>
<td>Alice convolutional</td>
<td>(2, 2N+1) –> (4, N) –> (4, N) –> (1, N)</td>
</tr>
<tr>
<td>Bob input</td>
<td>2N</td>
</tr>
<tr>
<td>Bob hidden</td>
<td>2N</td>
</tr>
<tr>
<td>Bob convolutional</td>
<td>As per Alice</td>
</tr>
<tr>
<td>Eve hidden 1</td>
<td>2N</td>
</tr>
<tr>
<td>Eve hidden 2</td>
<td>2N</td>
</tr>
<tr>
<td>Eve convolutional</td>
<td>As per Alice</td>
</tr>
</tbody>
</table>
<h3 id="loss-functions">Loss functions</h3>
<p>The next ingredient are the loss functions:</p>
<p>Eve’s loss function is simply the absolute difference between the message vector <strong><em>P</em></strong> and her attempt at decrypting it <strong><em>P<sub>Eve</sub></em></strong>. This loss is backpropagated through the weights in the Eve network to update her weights.</p>
<p><strong><em>Loss<sub>Eve</sub> = Eve L1 error = abs(P - P<sub>Eve</sub>)</em></strong></p>
<p>Bob and Alice’s loss function is also the absolute decryption error, <strong>plus</strong> an additional term that signifies how well Eve is currently decrypting the message. This additional term is:</p>
<p><strong><em>(N / 2 - Eve L1 error)<sup>2</sup> / (N / 2)<sup>2</sup></em></strong></p>
<p>This is minimised when Eve gets half the bits correct, on average, in her decryption, which indicates that she is doing no better than random guessing. The gradient of this loss function is used to update the weights in both the Bob and Alice networks. Putting these together, we have:</p>
<p><strong><em>Loss<sub>Bob</sub> = Bob L1 error + (N / 2 - Eve L1 error)<sup>2</sup> / (N / 2)</em></strong></p>
<h3 id="adversarial-training">Adversarial training</h3>
<p>Training then proceeds in an adversarial way: we train Bob and Alice for a while until they are communicating effectively, then we train Eve for some time, until she learns to decrypt the message. Then we train Bob and Alice again, who figure out how to fool Eve’s current decryption method. Then we train Eve again, who cracks the improved encryption, and so on. The authors find that after a while, Bob and Alice’s encryption becomes too strong for Eve to be able to learn to crack it.</p>
<h2 id="theano-implementation">Theano implementation</h2>
<p>The implementation is fairly straightforward. I have used some custom classes, <code class="language-plaintext highlighter-rouge">HiddenLayer()</code> and <code class="language-plaintext highlighter-rouge">ConvLayer()</code> for adding standard NN layers. You can find these in the <code class="language-plaintext highlighter-rouge">layers.py</code> file.</p>
<p>We first need to define our batch size and N up front. I have added the possibility to have different lengths for the key, message and communication, however I have not tested this; changing it might cause issues.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">batch_size</span> <span class="o">=</span> <span class="mi">512</span>
<span class="n">msg_len</span> <span class="o">=</span> <span class="mi">16</span>
<span class="n">key_len</span> <span class="o">=</span> <span class="mi">16</span>
<span class="n">comm_len</span> <span class="o">=</span> <span class="mi">16</span></code></pre></figure>
<h3 id="alice-and-bob">Alice and Bob</h3>
<p>For Alice and Bob we just create a fairly straightforward sequential NN:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Tensor variables for the message and key
</span><span class="n">msg_in</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="s">'msg_in'</span><span class="p">)</span>
<span class="n">key</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="s">'key'</span><span class="p">)</span>
<span class="c1"># Alice's input is the concatenation of the message and the key
</span><span class="n">alice_in</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">msg_in</span><span class="p">,</span> <span class="n">key</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># Alice's hidden layer
</span><span class="n">alice_hid</span> <span class="o">=</span> <span class="n">HiddenLayer</span><span class="p">(</span><span class="n">alice_in</span><span class="p">,</span>
<span class="n">input_size</span><span class="o">=</span><span class="n">msg_len</span> <span class="o">+</span> <span class="n">key_len</span><span class="p">,</span>
<span class="n">hidden_size</span><span class="o">=</span><span class="n">msg_len</span> <span class="o">+</span> <span class="n">key_len</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="s">'alice_to_hid'</span><span class="p">,</span>
<span class="n">act_fn</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)</span>
<span class="c1"># Reshape the output of Alice's hidden layer for convolution
</span><span class="n">alice_conv_in</span> <span class="o">=</span> <span class="n">alice_hid</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">msg_len</span> <span class="o">+</span> <span class="n">key_len</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="c1"># Alice's convolutional layers
</span><span class="n">alice_conv</span> <span class="o">=</span> <span class="n">StandardConvSetup</span><span class="p">(</span><span class="n">alice_conv_in</span><span class="p">,</span> <span class="s">'alice'</span><span class="p">)</span>
<span class="c1"># Get the output communication
</span><span class="n">alice_comm</span> <span class="o">=</span> <span class="n">alice_conv</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">msg_len</span><span class="p">))</span>
<span class="c1"># Bob's input is the concatenation of Alice's communication and the key
</span><span class="n">bob_in</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">alice_comm</span><span class="p">,</span> <span class="n">key</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># He decrypts using a hidden layer and a conv net as per Alice
</span><span class="n">bob_hid</span> <span class="o">=</span> <span class="n">HiddenLayer</span><span class="p">(</span><span class="n">bob_in</span><span class="p">,</span>
<span class="n">input_size</span><span class="o">=</span><span class="n">comm_len</span> <span class="o">+</span> <span class="n">key_len</span><span class="p">,</span>
<span class="n">hidden_size</span><span class="o">=</span><span class="n">comm_len</span> <span class="o">+</span> <span class="n">key_len</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="s">'bob_to_hid'</span><span class="p">,</span>
<span class="n">act_fn</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)</span>
<span class="n">bob_conv_in</span> <span class="o">=</span> <span class="n">bob_hid</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">comm_len</span> <span class="o">+</span> <span class="n">key_len</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">bob_conv</span> <span class="o">=</span> <span class="n">StandardConvSetup</span><span class="p">(</span><span class="n">bob_conv_in</span><span class="p">,</span> <span class="s">'bob'</span><span class="p">)</span>
<span class="n">bob_msg</span> <span class="o">=</span> <span class="n">bob_conv</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">msg_len</span><span class="p">))</span></code></pre></figure>
<h3 id="eve">Eve</h3>
<p>Eve is similarly implemented. We just need to use <code class="language-plaintext highlighter-rouge">alice_comm</code> in her inputs.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Eve see's Alice's communication to Bob, but not the key
# She gets an extra hidden layer to try and learn to decrypt the message
</span><span class="n">eve_hid1</span> <span class="o">=</span> <span class="n">HiddenLayer</span><span class="p">(</span><span class="n">alice_comm</span><span class="p">,</span>
<span class="n">input_size</span><span class="o">=</span><span class="n">comm_len</span><span class="p">,</span>
<span class="n">hidden_size</span><span class="o">=</span><span class="n">comm_len</span> <span class="o">+</span> <span class="n">key_len</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="s">'eve_to_hid1'</span><span class="p">,</span>
<span class="n">act_fn</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)</span>
<span class="n">eve_hid2</span> <span class="o">=</span> <span class="n">HiddenLayer</span><span class="p">(</span><span class="n">eve_hid1</span><span class="p">,</span>
<span class="n">input_size</span><span class="o">=</span><span class="n">comm_len</span> <span class="o">+</span> <span class="n">key_len</span><span class="p">,</span>
<span class="n">hidden_size</span><span class="o">=</span><span class="n">comm_len</span> <span class="o">+</span> <span class="n">key_len</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="s">'eve_to_hid2'</span><span class="p">,</span>
<span class="n">act_fn</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)</span>
<span class="n">eve_conv_in</span> <span class="o">=</span> <span class="n">eve_hid2</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">comm_len</span> <span class="o">+</span> <span class="n">key_len</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">eve_conv</span> <span class="o">=</span> <span class="n">StandardConvSetup</span><span class="p">(</span><span class="n">eve_conv_in</span><span class="p">,</span> <span class="s">'eve'</span><span class="p">)</span>
<span class="n">eve_msg</span> <span class="o">=</span> <span class="n">eve_conv</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">msg_len</span><span class="p">))</span></code></pre></figure>
<h3 id="loss-functions-1">Loss functions</h3>
<p>Here we just implement the loss equations described in the previous section. Note that the additional term in Bob’s loss function is a bit simpler than the equation described above. Things have been set up such that a mean error of 1 means that half the bits were correctly decrypted (as bits are input as either -1 or 1, so a single error = 2). Hence the N/2 terms can be dropped from the implementation.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Eve's loss function is the L1 norm between true and recovered msg
</span><span class="n">decrypt_err_eve</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">T</span><span class="p">.</span><span class="n">abs_</span><span class="p">(</span><span class="n">msg_in</span> <span class="o">-</span> <span class="n">eve_msg</span><span class="p">))</span>
<span class="c1"># Bob's loss function is the L1 norm between true and recovered
</span><span class="n">decrypt_err_bob</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">T</span><span class="p">.</span><span class="n">abs_</span><span class="p">(</span><span class="n">msg_in</span> <span class="o">-</span> <span class="n">bob_msg</span><span class="p">))</span>
<span class="c1"># plus (N/2 - decrypt_err_eve) ** 2 / (N / 2) ** 2
# --> Bob wants Eve to do only as good as random guessing
</span><span class="n">loss_bob</span> <span class="o">=</span> <span class="n">decrypt_err_bob</span> <span class="o">+</span> <span class="p">(</span><span class="mf">1.</span> <span class="o">-</span> <span class="n">decrypt_err_eve</span><span class="p">)</span> <span class="o">**</span> <span class="mf">2.</span></code></pre></figure>
<h3 id="training-functions">Training functions</h3>
<p>The only tricky-ish thing here is making sure that the training function for Alice and Bob updates all their parameters, while Eve’s only updates her parameters. I use <code class="language-plaintext highlighter-rouge">lasagne.adam</code> for an implementation of the Adam SGD optimiser. I put the functions in dictionaries for ease of use in adversarial training.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Get all the parameters for Bob and Alice, make updates, train and pred funcs
</span><span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">'bob'</span> <span class="p">:</span> <span class="n">get_all_params</span><span class="p">([</span><span class="n">bob_conv</span><span class="p">,</span> <span class="n">bob_hid</span><span class="p">,</span>
<span class="n">alice_conv</span><span class="p">,</span> <span class="n">alice_hid</span><span class="p">])}</span>
<span class="n">updates</span> <span class="o">=</span> <span class="p">{</span><span class="s">'bob'</span> <span class="p">:</span> <span class="n">adam</span><span class="p">(</span><span class="n">loss_bob</span><span class="p">,</span> <span class="n">params</span><span class="p">[</span><span class="s">'bob'</span><span class="p">])}</span>
<span class="n">err_fn</span> <span class="o">=</span> <span class="p">{</span><span class="s">'bob'</span> <span class="p">:</span> <span class="n">theano</span><span class="p">.</span><span class="n">function</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">msg_in</span><span class="p">,</span> <span class="n">key</span><span class="p">],</span>
<span class="n">outputs</span><span class="o">=</span><span class="n">decrypt_err_bob</span><span class="p">)}</span>
<span class="n">train_fn</span> <span class="o">=</span> <span class="p">{</span><span class="s">'bob'</span> <span class="p">:</span> <span class="n">theano</span><span class="p">.</span><span class="n">function</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">msg_in</span><span class="p">,</span> <span class="n">key</span><span class="p">],</span>
<span class="n">outputs</span><span class="o">=</span><span class="n">loss_bob</span><span class="p">,</span>
<span class="n">updates</span><span class="o">=</span><span class="n">updates</span><span class="p">[</span><span class="s">'bob'</span><span class="p">])}</span>
<span class="n">pred_fn</span> <span class="o">=</span> <span class="p">{</span><span class="s">'bob'</span> <span class="p">:</span> <span class="n">theano</span><span class="p">.</span><span class="n">function</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">msg_in</span><span class="p">,</span> <span class="n">key</span><span class="p">],</span> <span class="n">outputs</span><span class="o">=</span><span class="n">bob_msg</span><span class="p">)}</span>
<span class="c1"># Get all the parameters for Eve, make updates, train and pred funcs
</span><span class="n">params</span><span class="p">[</span><span class="s">'eve'</span><span class="p">]</span> <span class="o">=</span> <span class="n">get_all_params</span><span class="p">([</span><span class="n">eve_hid1</span><span class="p">,</span> <span class="n">eve_hid2</span><span class="p">,</span> <span class="n">eve_conv</span><span class="p">])</span>
<span class="n">updates</span><span class="p">[</span><span class="s">'eve'</span><span class="p">]</span> <span class="o">=</span> <span class="n">adam</span><span class="p">(</span><span class="n">decrypt_err_eve</span><span class="p">,</span> <span class="n">params</span><span class="p">[</span><span class="s">'eve'</span><span class="p">])</span>
<span class="n">err_fn</span><span class="p">[</span><span class="s">'eve'</span><span class="p">]</span> <span class="o">=</span> <span class="n">theano</span><span class="p">.</span><span class="n">function</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">msg_in</span><span class="p">,</span> <span class="n">key</span><span class="p">],</span>
<span class="n">outputs</span><span class="o">=</span><span class="n">decrypt_err_eve</span><span class="p">)</span>
<span class="n">train_fn</span><span class="p">[</span><span class="s">'eve'</span><span class="p">]</span> <span class="o">=</span> <span class="n">theano</span><span class="p">.</span><span class="n">function</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">msg_in</span><span class="p">,</span> <span class="n">key</span><span class="p">],</span>
<span class="n">outputs</span><span class="o">=</span><span class="n">decrypt_err_eve</span><span class="p">,</span>
<span class="n">updates</span><span class="o">=</span><span class="n">updates</span><span class="p">[</span><span class="s">'eve'</span><span class="p">])</span>
<span class="n">pred_fn</span><span class="p">[</span><span class="s">'eve'</span><span class="p">]</span> <span class="o">=</span> <span class="n">theano</span><span class="p">.</span><span class="n">function</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">msg_in</span><span class="p">,</span> <span class="n">key</span><span class="p">],</span> <span class="n">outputs</span><span class="o">=</span><span class="n">eve_msg</span><span class="p">)</span></code></pre></figure>
<h3 id="convolution-layers">Convolution layers</h3>
<p>Since it is used in all three networks, I made a custom class for the <em>standard convolutional setup</em>. It stores all the parameters and tensors relavent to all of the convolutional layers in the model. I have tried to match the description of the convolution setup described in the paper:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">StandardConvSetup</span><span class="p">():</span>
<span class="s">'''
Standard convolutional layers setup used by Alice, Bob and Eve.
Input should be 4d tensor of shape (batch_size, 1, msg_len + key_len, 1)
Output is 4d tensor of shape (batch_size, 1, msg_len, 1)
'''</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">reshaped_input</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'unnamed'</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
<span class="bp">self</span><span class="p">.</span><span class="n">conv_layer1</span> <span class="o">=</span> <span class="n">ConvLayer</span><span class="p">(</span><span class="n">reshaped_input</span><span class="p">,</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="c1">#num outs, num ins, size
</span> <span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">stride</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span>
<span class="n">name</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">+</span> <span class="s">'_conv1'</span><span class="p">,</span>
<span class="n">border_mode</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">),</span>
<span class="n">act_fn</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">conv_layer2</span> <span class="o">=</span> <span class="n">ConvLayer</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">conv_layer1</span><span class="p">,</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">stride</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span>
<span class="n">name</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">+</span> <span class="s">'_conv2'</span><span class="p">,</span>
<span class="n">border_mode</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">),</span>
<span class="n">act_fn</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">conv_layer3</span> <span class="o">=</span> <span class="n">ConvLayer</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">conv_layer2</span><span class="p">,</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">stride</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span>
<span class="n">name</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">+</span> <span class="s">'_conv3'</span><span class="p">,</span>
<span class="n">border_mode</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">),</span>
<span class="n">act_fn</span><span class="o">=</span><span class="s">'relu'</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">conv_layer4</span> <span class="o">=</span> <span class="n">ConvLayer</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">conv_layer3</span><span class="p">,</span>
<span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">stride</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span>
<span class="n">name</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">+</span> <span class="s">'_conv4'</span><span class="p">,</span>
<span class="n">border_mode</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">),</span>
<span class="n">act_fn</span><span class="o">=</span><span class="s">'tanh'</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">output</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">conv_layer4</span><span class="p">.</span><span class="n">output</span>
<span class="bp">self</span><span class="p">.</span><span class="n">layers</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="p">.</span><span class="n">conv_layer1</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">conv_layer2</span><span class="p">,</span>
<span class="bp">self</span><span class="p">.</span><span class="n">conv_layer3</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">conv_layer4</span><span class="p">]</span>
<span class="bp">self</span><span class="p">.</span><span class="n">params</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">layers</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">params</span> <span class="o">+=</span> <span class="n">l</span><span class="p">.</span><span class="n">params</span></code></pre></figure>
<h3 id="training">Training</h3>
<p>To perform the adversarial training, I made a <code class="language-plaintext highlighter-rouge">train()</code> function that would train either Alice and Bob or Eve for some time. We then just iterate between calling this function on Alice and Bob, and then for Eve. The <code class="language-plaintext highlighter-rouge">gen_data()</code> function generates <code class="language-plaintext highlighter-rouge">batch_size</code> random message and key pairs. We train according to the loss, but for plotting we just store the decryption error for the party that is currently being trained.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Function for training either Bob+Alice or Eve for some time
</span><span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">bob_or_eve</span><span class="p">,</span> <span class="n">results</span><span class="p">,</span> <span class="n">max_iters</span><span class="p">,</span> <span class="n">print_every</span><span class="p">,</span> <span class="n">es</span><span class="o">=</span><span class="mf">0.</span><span class="p">,</span> <span class="n">es_limit</span><span class="o">=</span><span class="mi">100</span><span class="p">):</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iters</span><span class="p">):</span>
<span class="c1"># Generate some data
</span> <span class="n">msg_in_val</span><span class="p">,</span> <span class="n">key_val</span> <span class="o">=</span> <span class="n">gen_data</span><span class="p">()</span>
<span class="c1"># Train on this batch and get loss
</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">train_fn</span><span class="p">[</span><span class="n">bob_or_eve</span><span class="p">](</span><span class="n">msg_in_val</span><span class="p">,</span> <span class="n">key_val</span><span class="p">)</span>
<span class="c1"># Store absolute decryption error of the model on this batch
</span> <span class="n">results</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">results</span><span class="p">,</span>
<span class="n">err_fn</span><span class="p">[</span><span class="n">bob_or_eve</span><span class="p">](</span><span class="n">msg_in_val</span><span class="p">,</span> <span class="n">key_val</span><span class="p">).</span><span class="nb">sum</span><span class="p">()))</span>
<span class="c1"># Print loss now and then
</span> <span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="n">print_every</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">print</span> <span class="s">'training loss:'</span><span class="p">,</span> <span class="n">loss</span>
<span class="c1"># Early stopping if we see a low-enough decryption error enough times
</span> <span class="k">if</span> <span class="n">es</span> <span class="ow">and</span> <span class="n">loss</span> <span class="o"><</span> <span class="n">es</span><span class="p">:</span>
<span class="n">count</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">count</span> <span class="o">></span> <span class="n">es_limit</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">results</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="n">results</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">max_iters</span> <span class="o">-</span> <span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)))</span>
<span class="c1"># Initialise some empty results arrays
</span><span class="n">results_bob</span><span class="p">,</span> <span class="n">results_eve</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[]</span>
<span class="n">adversarial_iterations</span> <span class="o">=</span> <span class="mi">60</span>
<span class="c1"># Perform adversarial training
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">adversarial_iterations</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">2000</span>
<span class="n">print_every</span> <span class="o">=</span> <span class="mi">100</span>
<span class="k">print</span> <span class="s">'training bob and alice, run:'</span><span class="p">,</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span>
<span class="n">results_bob</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="s">'bob'</span><span class="p">,</span> <span class="n">results_bob</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">print_every</span><span class="p">,</span> <span class="n">es</span><span class="o">=</span><span class="mf">0.01</span><span class="p">)</span>
<span class="k">print</span> <span class="s">'training eve, run:'</span><span class="p">,</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span>
<span class="n">results_eve</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="s">'eve'</span><span class="p">,</span> <span class="n">results_eve</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">print_every</span><span class="p">,</span> <span class="n">es</span><span class="o">=</span><span class="mf">0.01</span><span class="p">)</span></code></pre></figure>
<h2 id="results">Results</h2>
<p>I trained both Alice and Bob, and then Eve, for up to 2000 iterations at a time (early stopping occurred if the decryption error was below 0.01 for a while). I did 60 overall repetitions of this adversarial training setup. I then plotted the minimum decryption error achieved by Bob and by Eve in each of these 60 runs (Figure 5).</p>
<p><img src="/images/crypt/results_conv.png" alt="Bob and Eve's decryption errors over 60 adversarial training iterations" />
<em>Figure 5: Bob and Eve’s decryption errors over 60 adversarial training iterations.</em></p>
<p>So, it seems to work. After a few adversarial rounds, Bob and Alice figure out a way to effectively scramble the communication such that Eve cannot learn how to decrypt it.</p>
<p>I also tested the setup without the four convolutional layers, instead replacing this with an additional 2N in, 1N out hidden layer (Figure 6).</p>
<p><img src="/images/crypt/results_noconv.png" alt="Bob and Eve's decryption errors over 60 adversarial training iterations, with the convolutional phase of the network excluded" />
<em>Figure 6: Bob and Eve’s decryption errors over 60 adversarial training iterations, with the convolutional phase of the network excluded.</em></p>
<p>This seems to suggest that the convolution layers helps, but perhaps it is still possible to achieve the goals of this experiment without it - Eve still isn’t able to perfectly recover the message in this setup either.</p>
<h2 id="final-thoughts">Final thoughts</h2>
<p>I should note that this paper didn’t receive much love when it was <a href="https://www.reddit.com/r/MachineLearning/comments/59v9ua/r_161006918_learning_to_protect_communications/">posted</a> on the Reddit MachineLearning forum. And I have to say I kind of agree with the points made in that discussion: really the fact that this works doesn’t mean it has created good encryption. Rather it more just speaks to the weakness of the Eve network in its ability to decrypt the message. This is sort of reflected by the fact that this setup still seems to work without the convolution layers (Figure 6). Still, it is an interesting idea, and I don’t think I’m in a position to judge its academic merit.</p>
<p>Thanks for reading - thoughts, comments or questions are welcome!</p>
<h4 id="references">References</h4>
<p><a name="cite1">1</a>: Abadi, M & Andersen, D. <a href="https://arxiv.org/abs/1610.06918">Learning to Protect Communications with Adversarial Neural Cryptography</a>. October 24 2016. Google Brain.</p>liam schoneveldLast week I read Abadi and Andersen’s recent paper [1], Learning to Protect Communications with Adversarial Neural Cryptography. I thought the idea seemed pretty cool and that it wouldn’t be too tricky to implement, and would also serve as an ideal project to learn a bit more Theano. This post describes the paper, my implementation, and the results.Detecting Music BPM using Neural Networks - Update2016-09-25T00:00:00+00:002016-09-25T00:00:00+00:00https://nlml.github.io/neural-networks/detecting-bpm-neural-networks-update<p>This post is a brief update to my <a href="https://nlml.github.io/neural-networks/detecting-bpm-neural-networks/">previous post</a> about using a neural network to detect the beats per minute (BPM) in short sections of audio.</p>
<p>This post is also accompanied by a new, more complete and commented version of the code. I have also decided to upload the training and validation data I used.<sup><a href="#footnote1">1</a></sup></p>
<p>If you just want to see or run the code yourself, then feel free to <a href="#howtorunthecode">skip ahead</a>.</p>
<h2 id="key-changes">Key changes</h2>
<p>The key differences between this and the last version are:</p>
<ul>
<li>
<p>I now use the raw audio (downsampled to 11kHz) as the input to the neural network, rather than a spectogram</p>
</li>
<li>
<p>The raw audio vector is reshaped into a matrix of audio chunks, which are all processed in the same way. By default, 4 seconds of audio is represented by a 44100-length vector, which is reshaped into a matrix of size 441x100 (441 ‘timesteps’ with 100 ‘features’)</p>
</li>
<li>
<p>This input matrix is then fed into a neural network architecture that is much simpler: just a Convolution1D layer followed by max pooling</p>
</li>
<li>
<p>Training samples are now extracted randomly during training from the full-length audio files. This is achieved through the DataGen class. This means a much larger number of training samples can be synthesised without using additional memory</p>
</li>
</ul>
<p>The Keras neural network specification now looks like this:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Specify and compile the neural network model
</span><span class="n">max_pool</span> <span class="o">=</span> <span class="mi">4</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Convolution1D</span><span class="p">(</span><span class="mi">4</span> <span class="o">*</span> <span class="n">max_pool</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">border_mode</span><span class="o">=</span><span class="s">'same'</span><span class="p">,</span>
<span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="n">gen_train</span><span class="p">.</span><span class="n">num_chunks</span><span class="p">,</span>
<span class="n">gen_train</span><span class="p">.</span><span class="n">num_features</span><span class="p">)))</span>
<span class="k">if</span> <span class="n">max_pool</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Reshape</span><span class="p">((</span><span class="mi">1764</span> <span class="o">*</span> <span class="n">max_pool</span><span class="p">,</span> <span class="mi">1</span><span class="p">)))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">MaxPooling1D</span><span class="p">(</span><span class="n">pool_length</span><span class="o">=</span><span class="n">max_pool</span><span class="p">))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Flatten</span><span class="p">())</span>
<span class="n">model</span><span class="p">.</span><span class="n">summary</span><span class="p">()</span>
<span class="n">model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'mse'</span><span class="p">,</span> <span class="n">optimizer</span><span class="o">=</span><span class="n">Adam</span><span class="p">())</span></code></pre></figure>
<p>Let’s break this network topology down. We have:</p>
<ol>
<li>An input of size (4 seconds * 44100 hz / 4 (downsampling rate)) = 44100 length vector (actually this is reshaped to 441x100 prior to input)</li>
<li>This is fed into a Convolution1D layer, with 4 * <code class="language-plaintext highlighter-rouge">max_pool</code> (=4 by default) filters, and a 1D convolution filter size of 3</li>
<li>A Reshape() layer just flattens the output of all these filters into one long vector</li>
<li>Max pooling takes the max of every 4 numbers in this vector</li>
<li>Then this goes through a ReLU activation</li>
</ol>
<h3 id="figuring-out-what-keras-convolution1d-layer-is-actually-doing">Figuring out what Keras’ Convolution1D layer is actually doing</h3>
<p>The only tricky thing here is the Convolution1D layer. It took me a while to figure out exactly what this is doing, and I couldn’t find it explained that clearly anywhere else on the web, so I’ll try to explain it here:</p>
<ul>
<li>
<p>The Convolution1D layer takes an input matrix of shape Time Steps x Features</p>
</li>
<li>
<p>It then reduces this to something of shape Time Steps x 1. So essentially it reduces all of the features to just a single number</p>
</li>
<li>
<p>However, it does this for each filter, so if we have 16 filters, we end up with an output matrix of size Time Steps x 16</p>
</li>
<li>
<p>But, assuming we have just one filter, it convolves that 1D filter over the features. So if you imagine we have time steps on the horizontal axis, and features on the vertical axis, at every time step, it passes the same 1D convolution filter over the vertical axis</p>
</li>
<li>
<p>Then, a dense layer is used to go from a matrix of size Time Steps x Features to a matrix of size Time Steps x 1. So basically a weighted combination of the post-convolution features is taken at every time step. The same weights are used for all time steps</p>
</li>
<li>
<p>These last two steps are repeated for each filter, so actually we end up with an output matrix of size Features x Num. Filters</p>
</li>
</ul>
<h2 id="results-and-performance">Results and performance</h2>
<h3 id="training-set-performance">Training set performance</h3>
<p>Here is the performance on three random four second sections of songs in the training set. You can see the audio in blue, the actual beats in black, and the predicted beats in red.</p>
<p><img src="/images/bpm2/train_1.png" alt="Performance on the training set" />
<img src="/images/bpm2/train_3.png" alt="Performance on the training set" />
<img src="/images/bpm2/train_5.png" alt="Performance on the training set" />
<em>Figures: performance on some random four second clips of audio from the training set.</em></p>
<p>The predictions (red) might seem a bit noisy, but they’re still doing a pretty good job of picking up the onset of drum hits in the music.</p>
<p>The last figure shows how this is working more as an onset detector, rather than predicting where the actual beats of the bar are. In other words, it is working more as a ‘drum hits detector’ rather than a ‘beats in a bar detector.’</p>
<p>This is still useful however, and in some circumstances desirable. We can’t really expect our program to detect where the beats are by looking at just four seconds of audio. However by applying this predictor to larger sections of audio, the autocorrelation function of the predicted onsets can be used to infer the BPM (I experimented with this, and found it to be quite an accurate way of detecting the BPM: it was usually within 1 or 2 BPM of the actual song BPM in about 90% of tracks).</p>
<h3 id="validation-set-performance">Validation set performance</h3>
<p>Another nice outcome is that performance on the validation data is pretty much the same as that on training data, meaning that the patterns the model has learned generalise well (at least to similar tracks, since I’ve only trained with electronic music):</p>
<p><img src="/images/bpm2/vali_1.png" alt="Performance on the validataion set" />
<img src="/images/bpm2/vali_4.png" alt="Performance on the validataion set" />
<img src="/images/bpm2/vali_5.png" alt="Performance on the validataion set" />
<em>Figures: performance on some random four second clips of audio from the validation set.</em></p>
<p>The last figure here also shows that the predictions are somewhat invariant to the amplitude (loudness) of the music being analysed.</p>
<p><a name="howtorunthecode"></a></p>
<h2 id="how-to-run-the-code">How to run the code</h2>
<p>To run the code yourself:</p>
<ol>
<li>Clone the <a href="https://github.com/nlml/bpm2">repo</a></li>
<li>Extract the <a href="https://mega.nz/#!yZZWQJwK!eBAXyY2_Qxi6McaJ3Mnj-BdQ0zfVGdYFw0xmsv8Lc4o">training and validation data</a> to the root of that repo</li>
<li>Run <code class="language-plaintext highlighter-rouge">fit_nn.py</code> to train the neural network and visualise the results</li>
<li>Alternatively or additionally, you could create your own training data by placing 44.1kHz .wavs in the <code class="language-plaintext highlighter-rouge">wavs</code> subdirectory, and then running <code class="language-plaintext highlighter-rouge">wavs_to_features.py</code>. These .wavs need to start exactly on the first beat, and have their true integer BPM as the first thing in the filename, followed by a space (see comments in code for further explanation).</li>
</ol>
<h2 id="final-thoughts">Final thoughts</h2>
<p>I played around with other, more complex network topologies, but found that this simple Convolution1D structure worked much better than anything else. Its simplicity also makes it very fast to train (at least on the quite old GPU in my laptop).</p>
<p>It is very curious that this network structure works so well. The output of each filter only has 441 timesteps, but it is clear that the predictions from the model are much more granular than this. It seems that certain filters are specialising in particular sections of each of the 441 time ‘chunks.’</p>
<p>In future it would be very interesting to drill down into the weights and see how this model is actually working. If anyone else wants to look into this then please do, and please share your findings!</p>
<p><strong>Thanks for reading - would love to hear any thoughts!</strong></p>
<h4 id="footnotes">Footnotes</h4>
<p><a name="footnote1">1</a>: I figure sharing 11kHz audio embedded in Python objects isn’t too bad a violation of copyright - if any of the material owners find this and disagree please feel free to get in contact with me and I will take it down.</p>liam schoneveldThis post is a brief update to my previous post about using a neural network to detect the beats per minute (BPM) in short sections of audio.Detecting Music BPM using Neural Networks2016-08-30T00:00:00+00:002016-08-30T00:00:00+00:00https://nlml.github.io/neural-networks/detecting-bpm-neural-networks<p>I have always wondered whether it would be possible to detect the tempo (or beats per minute, or BPM) of a piece of music using a neural network-based approach. After a small experiment a while back, I decided to make a more serious second attempt. Here’s how it went.</p>
<h2 id="approach">Approach</h2>
<p>Initially I had to throw around a few ideas regarding the best way to represent the input audio, the BPM, and what would be an ideal neural network architecture.</p>
<h3 id="input-data-format">Input data format</h3>
<p>One of the first decisions to make here is what general form the network’s input should take. I don’t know a whole lot about the physics side of audio, or frequency data more generally, but I am familiar with <a href="https://en.wikipedia.org/wiki/Fourier_analysis">Fourier analysis</a>, and spectograms.</p>
<p>I figured a frequency spectogram would serve as an appropriate input to whatever network I was planning on training. These basically contain time on the x-axis, and frequency bins on the y-axis. The values (pixel colour) then indicate the intensity of the audio signal at each frequency and time step.</p>
<p><img src="/images/freq_spectogram.png" alt="Example of Frequency Spectogram" title="Frequency Spectogram" /></p>
<p><em>An example frequency spectogram from a few seconds of electronic music. Note the kick drum on each beat in the lowest frequency bin.</em></p>
<h3 id="output-data-format-to-be-predicted-by-the-network">Output data format (to be predicted by the network)</h3>
<p>I had a few different ideas here. First I thought I might try predicting the BPM directly. Then I decided I could save the network some trouble by having it try to predict the location of the beats in time. The BPM could then be inferred from this. I achieved this by constructing what I call a ‘pulse vector’ as follows:</p>
<ul>
<li>
<p>Say we had a two second audio clip. We might represent this by a vector of zeroes of length 200 - a resolution of 100 frames per second.</p>
</li>
<li>
<p>Then say the tempo was 120 BPM, and the first beat was at the start of the clip. We could then create our target vector by setting (zero-indexed) elements [0, 50, 100, 150] of this vector to 1 (as 120 BPM implies 2 beats per second).</p>
</li>
</ul>
<p>We can relatively easily infer BPM from this vector (though its resolution will determine how accurately). As a bonus, the network will also (hopefully) tell us <em>where</em> the beats are, in addition to just how often they occur. This might be useful, for instance if we wanted to synchronise two tracks together.</p>
<p><img src="/images/input_spectogram_and_output_pulses.png" alt="Input Spectogram and Output Pulses" title="Input Spectogram and Output Pulses" /></p>
<p><em>This image overlays the target output pulse vector (black) over the input frequency spectogram of a clip of audio.</em></p>
<h3 id="neural-network-architecture">Neural network architecture</h3>
<p>My initial architecture involved just dense layers. I was working in Lasagne. I soon discovered the magic of Keras however, when looking for a way to apply the same dense layer to every time step. After switching to Keras, I also added a convolutional layer. So the current architecture is essentially a convolutional neural network. My intuition behind the inclusion and order of specific network layers is covered further below.</p>
<h2 id="creating-the-training-data">Creating the training data</h2>
<p>The main training data was obtained from my Traktor collection. Traktor is a DJing program, which is quite capable of detecting the BPM of the tracks you give it, particularly for electronic music. I have not had Traktor installed for a while, but a lot of the mp3 files in my music collection still have the Traktor-detected BPM stored with the file.</p>
<p>I copied around 30 of these mp3’s to a folder, however later realised that they still needed a bit more auditing - files needed to start exactly on the first beat, and needed to not get out of time throughout the song under the assumed BPM. Therefore I opened each in <a href="reaper.fm">Reaper</a> (a digital audio workstation), chopped each song to start on exactly the first beat, ensured they didn’t go out of time, and then exported them to wav.</p>
<p><strong>Going from mp3/wav files to training data is all performed by the</strong> <code class="language-plaintext highlighter-rouge">mp3s_to_fft_features.py</code> <strong>script.</strong></p>
<p>~<del>I then converted<sup><a href="#footnote1">1</a></sup> these to wav and read them into Python (using <a href="https://pypi.python.org/pypi/wavio">wavio</a>). I also read the BPM from each mp3 into Python (using <a href="https://pypi.python.org/pypi/id3reader">id3reader</a>).</del>~</p>
<p>-> I now already already have the songs in wav format, and the BPMs were read from the filenames, which I manually entered.</p>
<p>The wav is then converted to a spectogram. This was achieved by:</p>
<ol>
<li>Taking a sample of length <code class="language-plaintext highlighter-rouge">fft_sample_length</code> (default 768) every <code class="language-plaintext highlighter-rouge">fft_step_size</code> (default 512) samples</li>
<li>Performing a fast fourier transform (FFT) on each of these samples</li>
</ol>
<p>The target pulse vector matching the wav’s BPM is then created using the function <code class="language-plaintext highlighter-rouge">get_target_vector</code>.</p>
<p>Then random subsets of length <code class="language-plaintext highlighter-rouge">desired_X_time_dim</code> are taken in pairs from both the spectogram and target pulse vector. By this, we generate lots of training inputs and outputs that are a more manageable length from just the one set of training inputs. Each sample represents about 6 seconds of audio, with different offsets for where the beats are placed (so our model has to predict where the beats are, as well as how often they occur).</p>
<p>For each ~6 second sample, we now have a 512x32 matrix as training input - 512 time frames and 32 frequency bins (the number of frequency bins can be reduced by increasing the <code class="language-plaintext highlighter-rouge">downsample</code> argument) - and a 512x1 pulse vector as training output.</p>
<p>In the latest version of the model, I have 18 songs to sample from. I create a training set by sampling from the first 13 songs, and validation and test sets by sampling from the last 5 songs. The training set contained 28800 samples.</p>
<h2 id="specifying-and-training-the-neural-network">Specifying and training the neural network</h2>
<h3 id="network-architecture---overview">Network architecture - overview</h3>
<p>As described above, I decided to go with a convolutional neural network architecture. It looked something like this:</p>
<p><img src="/images/convnet_diagram.png" alt="Diagram Depicting the Convolutional Neural Network's Architecture" title="Diagram Depicting the Convolutional Neural Network's Architecture" /></p>
<p><em>An overview of the neural network architecture.</em></p>
<p>In words, the diagram/architecture can be described as follows:</p>
<ul>
<li>
<p>The input spectogram is passed through two sequential convolutional layers</p>
</li>
<li>
<p>The output is then reshaped into a ‘time by other’ representation</p>
</li>
<li>
<p>Keras’ TimeDistributed Dense layers are then used (in these layers, each time step is passed through the same dense layer; this substantially reduces the number of parameters needed to be estimated)</p>
</li>
<li>
<p>Finally, the output is reduced to one dimension, and passed through some additional desnse layers before producing the output</p>
</li>
</ul>
<h3 id="network-architecture---details">Network architecture - details</h3>
<p>The below code snippets give specific details as to the network architecture and its implementation in Keras.</p>
<p>First, we have two convolution layers:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Convolution2D</span><span class="p">(</span><span class="n">num_filters</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">border_mode</span><span class="o">=</span><span class="s">'same'</span><span class="p">,</span>
<span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">input_time_dim</span><span class="p">,</span> <span class="n">input_freq_dim</span><span class="p">)))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">MaxPooling2D</span><span class="p">(</span><span class="n">pool_size</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Convolution2D</span><span class="p">(</span><span class="n">num_filters</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="n">border_mode</span><span class="o">=</span><span class="s">'same'</span><span class="p">))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">))</span></code></pre></figure>
<p>I limited the amount of max-pooling. Max-pooling over the first dimension would reduce the time granularity, which I feel is important in our case, and in the second dimension we don’t have much granularity as it is (just the 32 frequency bins). Hence I only performed max pooling over the frequency dimension, and only once. I am still experimenting with the convolutional layers’ setup, but the current configuartion seems to produce decent results.</p>
<p>I then reshape the output of the convolution filters so that we again have a ‘time by other stuff’ representation. This allows us to add some <code class="language-plaintext highlighter-rouge">TimeDistributed</code> layers. We have a matrix input of something like 512x1024 here, with the 1024 representing the outputs of all the convolutions. The <code class="language-plaintext highlighter-rouge">TimeDistributed</code> layers allow us to go down to something like 512x256, but with only one (1024x256) weight matrix. This dense layer is then used at all time steps. In other words, these layers densely connect the outputs at each time step to the inputs in the corresponding time steps of the following layer. The overall benefit of this is that far fewer parameters need to be learned.</p>
<p>The intuition behind this is that if we have a 1024-length vector representing each time step, then we can probably learn a useful representation at a lower dimension of that time step, which will get us to a matrix size that will actually fit in memory when we try to add some dense layers afterwards.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Reshape</span><span class="p">((</span><span class="n">input_time_dim</span><span class="p">,</span> <span class="n">input_freq_dim</span> <span class="o">*</span> <span class="n">num_filters</span><span class="p">)))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">TimeDistributed</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">256</span><span class="p">)))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">TimeDistributed</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">8</span><span class="p">)))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">))</span></code></pre></figure>
<p>Finally, we flatten everything and add a few dense layers. These simultaneously take into account both the time and frequency dimensions. This should be important, as the model can try to incorporate things like the fact that beats should be evenly spaced over time.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Flatten</span><span class="p">())</span>
<span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">dense_widths</span><span class="p">:</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="n">w</span><span class="p">))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dropout</span><span class="p">(</span><span class="n">drop_hid</span><span class="p">))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="n">output_length</span><span class="p">))</span>
<span class="n">model</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">))</span></code></pre></figure>
<h2 id="results">Results</h2>
<p>Usually the model got to a point where validation error stopped reducing after 9 or so epochs.</p>
<p>With the current configuration, the model appears to be able to detect beats in the music to some extent. Note that I’ve actually switched to inputs and outputs of length 160 (in the time dimension), though I was able to achieve similar results on the original 512-length data.</p>
<p>This first plot shows typical performance on audio clips within the training set:</p>
<p><img src="/images/train_pulses_pred.png" alt="Pulse Prediction in Training Set" title="Pulse Prediction in Training Set" /></p>
<p><em>Predicted (blue) vs actual (green) pulses - typical performance over the training set.</em></p>
<p>Performance is not as good when trying to predict pulse vectors derived from songs that were not in the training data. That said, on some songs the network still gets it (nearly) right. It also often gets the frequency of the beats correct, even though those beats are not in the correct position:</p>
<p><img src="/images/val_pulse_pred.png" alt="Pulse Prediction in Validation Set" title="Pulse Prediction in Validation Set" /></p>
<p><em>Predicted (blue) vs actual (green) pulses - typical performance over the validation set.</em></p>
<p>If we plot these predictions/actuals over the input training data, we can compare our own intuition to that of the neural network:</p>
<p><img src="/images/train_spect_pred.png" alt="Pulse Prediction in Training Set Over Spectogram" title="Pulse Prediction in Training Set Over Spectogram" /></p>
<p><em>Predicted (black) vs actual (white) pulses plotted over spectogram - typical performance over the training set.</em></p>
<p>Take this one validation set example. I would find it hard to tell where the beats are by looking at this image, but the neural net manages to figure it out at least semi-accurately.</p>
<p><img src="/images/val_spect_pred.png" alt="Pulse Prediction in Validation Set Over Spectogram" title="Pulse Prediction in Validation Set Over Spectogram" /></p>
<p><em>Predicted (black) vs actual (white) pulses plotted over spectogram - typical performance over the validation set.</em></p>
<h2 id="next-steps">Next steps</h2>
<p>This is still a work in progress, but I think the results show far have shown that this approach has potential. From here I’ll be looking to:</p>
<ul>
<li>
<p>Use far more training data - I think many more songs are needed for the neural network to learn the general patterns that indicate beats in music</p>
</li>
<li>
<p>Read up on convolutional architectures to better understand what might work best for this particular situation</p>
</li>
<li>
<p>An approach I’ve been thinking might work better: adjust the network architecture to do ‘beat detection’ on shorter chunks of audio, then combine the output of this over a longer duration. This longer output can then serve as the input to a neural network that ‘cleans up’ the beat predictions by using the context of the longer duration</p>
</li>
</ul>
<p>I still need to clean up the code a bit, but you can get a feel for it <a href="https://github.com/nlml/bpm">here</a>.</p>
<h2 id="random-other-thoughts">Random other thoughts</h2>
<ul>
<li>
<p>I first thought of approaching this problem using a long-short term memory (LSTM) network. The audio signal would be fed in frame-by-frame as a frequency spectogram, and then at each step the network would output whether or not that time step represents the start of a beat. This is still an appealing prospect, however I decided to try a network architecture that I was more familiar with</p>
</li>
<li>
<p>I tried a few different methods for producing audio training data for the network. For the proof-of-concept phase, I created a bunch of wav’s with just sine tones at varying pitches, decaying quickly and played only on the beat, at various BPM’s. It was quite easy to get the network to learn to recognise the BPM from these. A step up from this was taking various tempo-synced break beats, and saving them down at different tempos. These actually proved difficult to learn from - just as hard as real audio files</p>
</li>
<li>
<p>It might be also interesting to try working with the raw wav data as the input</p>
</li>
</ul>
<h4 id="footnotes">Footnotes</h4>
<p><a name="footnote1">1</a>: In the code, the function <code class="language-plaintext highlighter-rouge">convert_an_mp3_to_wav(mp3_path, wav_path)</code> tells Linux to use mpg123 to convert the input <code class="language-plaintext highlighter-rouge">mp3_path</code> to the output <code class="language-plaintext highlighter-rouge">wav_path</code>. If you are on Linux, you may need to install mpg123. If you are using a different operating system, you may need to replace this with your own function that converts the input mp3 to the output wav.</p>liam schoneveldI have always wondered whether it would be possible to detect the tempo (or beats per minute, or BPM) of a piece of music using a neural network-based approach. After a small experiment a while back, I decided to make a more serious second attempt. Here’s how it went.Facebook Recruiting IV2015-06-10T00:00:00+00:002015-06-10T00:00:00+00:00https://nlml.github.io/kaggle/fb-recruiting-iv<p>The <a href="https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot">‘Facebook Recruiting IV: Human or bot?’</a> competition just ended on Kaggle. For those unfamiliar with the competition, participants downloaded a table of about 7 million bids, which corresponded to another table of around 6,000 bidders. For 4,000 of those bidders, you had to estimate the probability that they were a human or bot, based on the remaining 2,000 bidders, whose bot status was given.</p>
<p>I was first on the public leaderboard for some time, but ended up coming in 17th on the private leaderboard. Still, given I’m a beginner I was pretty happy with the outcome, and I learned a lot. In this post I’ll give an overview of my approach to feature engineering and modelling, and share some of the lessons learned. Everything was done in R.</p>
<h2 id="feature-extraction">Feature extraction</h2>
<p>Each subheading below describes a general group of features that I extracted from the data and used in modelling. Features were estimated for each bidder_id in the training and test sets, and combined into a matrix called ‘bidderCharacteristics’.</p>
<h3 id="reading-in-the-data">Reading in the data</h3>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">bids</span><span class="o"><-</span><span class="n">fread</span><span class="p">(</span><span class="s2">"Downloaded/bids.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">train</span><span class="o"><-</span><span class="n">fread</span><span class="p">(</span><span class="s2">"Downloaded/train.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">test</span><span class="o"><-</span><span class="n">fread</span><span class="p">(</span><span class="s2">"Downloaded/test.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre></figure>
<h3 id="total-number-of-unique-bids-countries-devices-ips-urls-merch-categories-for-each-bidder">Total number of unique bids, countries, devices, IPs, URLs, merch categories for each bidder</h3>
<p>The first set of features were just simple sums of the number of unique(x) where x is one of the variables in the bids table. The below code shows how I calculated the number of unique countries each bidder bid from.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#How many countries for each bidder?</span><span class="w">
</span><span class="n">nCountries</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">numCountries</span><span class="o">=</span><span class="n">with</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">country</span><span class="p">,</span><span class="w"> </span><span class="n">bidder_id</span><span class="p">,</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">x</span><span class="p">)))))</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">nCountries</span><span class="p">,</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">row.names</span><span class="p">(</span><span class="n">nCountries</span><span class="p">)),</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span></code></pre></figure>
<h3 id="proportion-of-bids-in-each-country-device-ip-url-merchandise-category">Proportion of bids in each country, device, IP, URL, merchandise category</h3>
<p>These features proved to be useful, particularly country. The below code shows how this was calculated for the country feature – a similar process was used for the other variables.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#Country proportions?</span><span class="w">
</span><span class="n">bidderIDsByCountry</span><span class="o"><-</span><span class="nf">round</span><span class="p">(</span><span class="n">with</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="n">table</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">country</span><span class="p">)),</span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">bidderIDsByCountry</span><span class="o"><-</span><span class="n">as.matrix</span><span class="p">(</span><span class="n">bidderIDsByCountry</span><span class="o">/</span><span class="n">rowSums</span><span class="p">(</span><span class="n">bidderIDsByCountry</span><span class="p">))</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">unique</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">bidder_id</span><span class="p">))</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">cty</span><span class="o">=</span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="nf">length</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="p">),</span><span class="n">ncol</span><span class="p">(</span><span class="n">bidderIDsByCountry</span><span class="p">)))</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">bidderCharacteristics</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="o">$</span><span class="n">bidder_id</span><span class="p">),]</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="p">(</span><span class="n">ncol</span><span class="p">(</span><span class="n">bidderIDsByCountry</span><span class="p">)</span><span class="m">+1</span><span class="p">)]</span><span class="o"><-</span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">as.matrix</span><span class="p">(</span><span class="n">bidderIDsByCountry</span><span class="p">[,</span><span class="m">1</span><span class="o">:</span><span class="n">ncol</span><span class="p">(</span><span class="n">bidderIDsByCountry</span><span class="p">)]))</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="p">)[</span><span class="m">2</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">colnames</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="p">))]</span><span class="o"><-</span><span class="nf">c</span><span class="p">(</span><span class="s1">'cty.none'</span><span class="p">,</span><span class="n">paste</span><span class="p">(</span><span class="s1">'cty.'</span><span class="p">,</span><span class="n">colnames</span><span class="p">(</span><span class="n">bidderIDsByCountry</span><span class="p">)[</span><span class="m">2</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">colnames</span><span class="p">(</span><span class="n">bidderIDsByCountry</span><span class="p">))],</span><span class="n">sep</span><span class="o">=</span><span class="s2">""</span><span class="p">))</span></code></pre></figure>
<p>Given the sheer number of IPs and URLs, I limited the lists of these to only those IPs and URLs that were used for at least 1000 bids. This still ended up giving me about 600 IP variables and 300 URL variables. Correlation plots showed highly-correlated clusters of some URLs and IPs. To reduce dimensionality I tried using principal components analysis (PCA) on these variables. It seemed to help with URLs; in my final model I included the top 50 principal components from the URLs. Not much help was provided with IPs – I ended up using RandomForest importance scores on the full set to decide on which ones to include, which wasn’t many in the end. Here’s the code used to perform pca on the urls:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">urls</span><span class="o"><-</span><span class="n">bidderCharacteristics</span><span class="p">[,</span><span class="n">grep</span><span class="p">(</span><span class="s2">"url\\."</span><span class="p">,</span><span class="n">colnames</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="p">))]</span><span class="w">
</span><span class="n">urls</span><span class="o"><-</span><span class="n">removeLowVarianceCols</span><span class="p">(</span><span class="n">urls</span><span class="p">,</span><span class="m">4</span><span class="p">)</span><span class="w">
</span><span class="n">url.pca</span><span class="o"><-</span><span class="n">prcomp</span><span class="p">(</span><span class="n">urls</span><span class="p">,</span><span class="w"> </span><span class="n">scale.</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">urls</span><span class="o"><-</span><span class="n">predict</span><span class="p">(</span><span class="n">url.pca</span><span class="p">,</span><span class="n">urls</span><span class="p">)[,</span><span class="m">1</span><span class="o">:</span><span class="m">50</span><span class="p">]</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">cbind</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">url</span><span class="o">=</span><span class="n">urls</span><span class="p">)</span></code></pre></figure>
<h3 id="mean-popularity-of-country-device-ip-url-merchandise-categories-used">Mean popularity of country, device, IP, URL, merchandise categories used</h3>
<p>I defined the ‘popularity’ of a particular country, device, or etc., as the number of unique bidder_id’s that bid from that variable. For each bidder, I then took the mean of these popularity scores given the countries, devices, etc., that they bid from. Here’s the code snippet used to calculate mean IP popularity:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#Mean popularity of IPs used</span><span class="w">
</span><span class="n">nBidderIDsPerIP</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">numBidderIDsPerIP</span><span class="o">=</span><span class="n">with</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="w"> </span><span class="n">ip</span><span class="p">,</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">x</span><span class="p">)))))</span><span class="w">
</span><span class="n">ipPopularity</span><span class="o"><-</span><span class="n">subset</span><span class="p">(</span><span class="n">bids</span><span class="p">[</span><span class="o">!</span><span class="n">duplicated</span><span class="p">(</span><span class="n">subset</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="w"> </span><span class="n">select</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">ip</span><span class="p">)))],</span><span class="n">select</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">ip</span><span class="p">))</span><span class="w">
</span><span class="n">ipPopularity</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">ipPopularity</span><span class="p">,</span><span class="n">cbind</span><span class="p">(</span><span class="n">as.data.table</span><span class="p">(</span><span class="n">nBidderIDsPerIP</span><span class="p">),</span><span class="n">ip</span><span class="o">=</span><span class="n">row.names</span><span class="p">(</span><span class="n">nBidderIDsPerIP</span><span class="p">)),</span><span class="n">by</span><span class="o">=</span><span class="s2">"ip"</span><span class="p">,</span><span class="n">all.x</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">ipPopularity</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">ipPop</span><span class="o">=</span><span class="n">with</span><span class="p">(</span><span class="n">ipPopularity</span><span class="p">,</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">numBidderIDsPerIP</span><span class="p">,</span><span class="w"> </span><span class="n">bidder_id</span><span class="p">,</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span><span class="w">
</span><span class="n">rm</span><span class="p">(</span><span class="n">nBidderIDsPerIP</span><span class="p">)</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">ipPopularity</span><span class="p">,</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">row.names</span><span class="p">(</span><span class="n">ipPopularity</span><span class="p">)),</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span></code></pre></figure>
<h3 id="mean-number-of-bids-from-countries-devices-ips-etc-bidded-from">Mean number of bids from countries, devices, IPs, etc., bidded from</h3>
<p>Very similar to the previous feature: this looked at how many bids were made from each country, device, etc., and then gave each bidder the mean of these features across the countries, devices, etc., that the bidded from.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#Mean number of bids for Countrys bidded from</span><span class="w">
</span><span class="n">nBidsPerCountry</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">numBidsEachCountry</span><span class="o">=</span><span class="n">with</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">bid_id</span><span class="p">,</span><span class="w"> </span><span class="n">country</span><span class="p">,</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span><span class="w">
</span><span class="n">biddersAndCountrys</span><span class="o"><-</span><span class="n">subset</span><span class="p">(</span><span class="n">bids</span><span class="p">[</span><span class="o">!</span><span class="n">duplicated</span><span class="p">(</span><span class="n">subset</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="w"> </span><span class="n">select</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">country</span><span class="p">)))],</span><span class="n">select</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">country</span><span class="p">))</span><span class="w">
</span><span class="n">biddersAndCountrys</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">country</span><span class="o">=</span><span class="n">row.names</span><span class="p">(</span><span class="n">nBidsPerCountry</span><span class="p">),</span><span class="n">nBidsPerCountry</span><span class="p">),</span><span class="n">biddersAndCountrys</span><span class="p">,</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'country'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'country'</span><span class="p">,</span><span class="n">all.x</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">biddersAndCountrys</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">meanCountryPopularity</span><span class="o">=</span><span class="n">with</span><span class="p">(</span><span class="n">biddersAndCountrys</span><span class="p">,</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">numBidsEachCountry</span><span class="p">,</span><span class="w"> </span><span class="n">bidder_id</span><span class="p">,</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span><span class="w">
</span><span class="n">rm</span><span class="p">(</span><span class="n">nBidsPerCountry</span><span class="p">)</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">biddersAndCountrys</span><span class="p">,</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">row.names</span><span class="p">(</span><span class="n">biddersAndCountrys</span><span class="p">)),</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span></code></pre></figure>
<h3 id="time-domain">Time domain</h3>
<p>As it appears many others in this competition also realised, it became clear fairly early on to me in the competition that the bids were from three distinct three-day time periods, and that the time between the first and last bid was probably very close to exactly 31 days. Based on this information I could convert the obfuscated ‘time’ field into more meaningful units.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">day</span><span class="o"><-</span><span class="p">(</span><span class="nf">max</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">time</span><span class="p">)</span><span class="o">-</span><span class="nf">min</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">time</span><span class="p">))</span><span class="o">/</span><span class="m">31</span><span class="w">
</span><span class="n">hour</span><span class="o"><-</span><span class="n">day</span><span class="o">/</span><span class="m">24</span><span class="w">
</span><span class="n">bids</span><span class="o">$</span><span class="n">hour24</span><span class="o"><-</span><span class="nf">floor</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">time</span><span class="o">/</span><span class="n">hour</span><span class="p">)</span><span class="o">%%</span><span class="m">24</span></code></pre></figure>
<p>A number of other features stemmed from having this info on hand…</p>
<h3 id="proportion-of-bids-in-each-hour-of-the-day">Proportion of bids in each hour of the day</h3>
<p>A plot of bot density by hour of the day showed bots were more common during the ‘off-peak’ bidding periods. This suggested that taking the total proportion of a user’s bids in each hour of the day was likely to be a useful feature. The below code shows how I did this:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">bidsPerTimeSlotPerBidder</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">with</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">bid_id</span><span class="p">,</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">hour24</span><span class="p">),</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span><span class="w">
</span><span class="n">bidsPerTimeSlotPerBidder</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">bidsPerTimeSlotPerBidder</span><span class="p">)]</span><span class="o"><</span><span class="m">-0</span><span class="w">
</span><span class="n">bidsPerTimeSlotPerBidder</span><span class="o"><-</span><span class="n">bidsPerTimeSlotPerBidder</span><span class="o">/</span><span class="n">rowSums</span><span class="p">(</span><span class="n">bidsPerTimeSlotPerBidder</span><span class="p">)</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">bidsPerTimeSlotPerBidder</span><span class="p">,</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">row.names</span><span class="p">(</span><span class="n">bidsPerTimeSlotPerBidder</span><span class="p">)),</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span></code></pre></figure>
<h3 id="bids-per-time-mean-time-between-bids-and-time-active">‘Bids per time’, mean time between bids, and ‘time active’</h3>
<p>Bids per time and mean time between bids are self-explanatory. Time active I defined as the time between a bidder’s first and last bid.</p>
<p>I originally extracted these three features using the entire bids table at once. I later realised however, that the features could be skewed by the fact that there were three distinct time chunks. For instance, the mean time between a user’s bids was calculated by looking at the average length of time between a user’s bids. If a user had two bids in separate time chunks, this metric would be artificially inflated by the missing data between the time chunks.</p>
<p>Thus I figured out the ‘cut off times’ for each bid chunk’s start and end, divided into three chunks, extracted my features from each, and then took the overall means of the three features:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#This section calculates the 'bid response time', how long between bids, and the total time active (in each 3 day 'chunk')</span><span class="w">
</span><span class="n">bidsO</span><span class="o"><-</span><span class="n">bids</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">auction</span><span class="p">,</span><span class="n">time</span><span class="p">)]</span><span class="w">
</span><span class="c1">#Reduce unnecessary granularity of the time field</span><span class="w">
</span><span class="n">bidsO</span><span class="o">$</span><span class="n">time</span><span class="o"><-</span><span class="nf">floor</span><span class="p">(</span><span class="n">bidsO</span><span class="o">$</span><span class="n">time</span><span class="o">/</span><span class="m">1e6</span><span class="p">)</span><span class="w">
</span><span class="n">cutoffTime1</span><span class="o"><</span><span class="m">-9670569</span><span class="o">*</span><span class="m">1e9</span><span class="o">/</span><span class="m">1e6</span><span class="w">
</span><span class="n">cutoffTime2</span><span class="o"><</span><span class="m">-9734233</span><span class="o">*</span><span class="m">1e9</span><span class="o">/</span><span class="m">1e6</span><span class="w">
</span><span class="c1">#Split the bids into three chunks according to the cut off times</span><span class="w">
</span><span class="n">bidsTimeChunk</span><span class="o"><-</span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="n">time1</span><span class="o"><-</span><span class="n">bidsO</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">bidsO</span><span class="o">$</span><span class="n">time</span><span class="o"><=</span><span class="n">cutoffTime1</span><span class="p">),]</span><span class="w">
</span><span class="p">,</span><span class="n">time2</span><span class="o"><-</span><span class="n">bidsO</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">bidsO</span><span class="o">$</span><span class="n">time</span><span class="o">></span><span class="n">cutoffTime1</span><span class="o">&</span><span class="n">bidsO</span><span class="o">$</span><span class="n">time</span><span class="o"><</span><span class="n">cutoffTime2</span><span class="p">),]</span><span class="w">
</span><span class="p">,</span><span class="n">time3</span><span class="o"><-</span><span class="n">bidsO</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">bidsO</span><span class="o">$</span><span class="n">time</span><span class="o">>=</span><span class="n">cutoffTime2</span><span class="p">),]</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1">#Initialisation</span><span class="w">
</span><span class="n">meanTimeDiffByBidder</span><span class="o"><-</span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">firstLastBid</span><span class="o"><-</span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">numBids</span><span class="o"><-</span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">overallMean</span><span class="o"><</span><span class="m">-0</span><span class="w">
</span><span class="c1">#Calculate mean difference in time between bids for each bidder.</span><span class="w">
</span><span class="c1">#Do this by lagging the bids table by one bid, then subtracting the lagged bid time from the original. Then take the average of this for each bidder.</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">){</span><span class="w">
</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">auctionL1</span><span class="o"><-</span><span class="nf">c</span><span class="p">(</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">auction</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">auction</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]])</span><span class="m">-1</span><span class="p">)])</span><span class="w">
</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">timeDiff</span><span class="o"><-</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">time</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">time</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">time</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]])</span><span class="m">-1</span><span class="p">)])</span><span class="w">
</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">timeDiff</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="o"><-</span><span class="kc">NA</span><span class="w">
</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">timeDiff</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">auction</span><span class="o">!=</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">auctionL1</span><span class="p">)]</span><span class="o"><-</span><span class="kc">NA</span><span class="w">
</span><span class="n">meanTimeDiffByBidder</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o"><-</span><span class="n">ddply</span><span class="p">(</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]],</span><span class="o">~</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">summarise</span><span class="p">,</span><span class="n">mean</span><span class="o">=</span><span class="n">mean</span><span class="p">(</span><span class="n">timeDiff</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">))</span><span class="w">
</span><span class="n">overallMean</span><span class="o"><-</span><span class="n">overallMean</span><span class="o">+</span><span class="n">mean</span><span class="p">(</span><span class="n">meanTimeDiffByBidder</span><span class="p">[[</span><span class="n">i</span><span class="p">]][,</span><span class="m">2</span><span class="p">],</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="o">*</span><span class="n">nrow</span><span class="p">(</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]])</span><span class="o">/</span><span class="n">nrow</span><span class="p">(</span><span class="n">bidsO</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">#Replace any NAs with the overall mean</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">){</span><span class="w">
</span><span class="n">meanTimeDiffByBidder</span><span class="p">[[</span><span class="n">i</span><span class="p">]][</span><span class="n">which</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">meanTimeDiffByBidder</span><span class="p">[[</span><span class="n">i</span><span class="p">]][,</span><span class="m">2</span><span class="p">])),</span><span class="m">2</span><span class="p">]</span><span class="o"><-</span><span class="n">overallMean</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">#Calculated ‘bids per time’ and ‘time active’</span><span class="w">
</span><span class="n">bidsPerTime</span><span class="o"><-</span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">timeActive</span><span class="o"><-</span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">overallMean</span><span class="o"><</span><span class="m">-0</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">){</span><span class="w">
</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o"><-</span><span class="n">ddply</span><span class="p">(</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]],</span><span class="o">~</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">summarise</span><span class="p">,</span><span class="n">firstBid</span><span class="o">=</span><span class="nf">min</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">),</span><span class="n">lastBid</span><span class="o">=</span><span class="nf">max</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">))</span><span class="w">
</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">timeActive</span><span class="o"><-</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">lastBid</span><span class="o">-</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">firstBid</span><span class="w">
</span><span class="n">numBids</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">numBids</span><span class="o">=</span><span class="n">with</span><span class="p">(</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]],</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">bid_id</span><span class="p">,</span><span class="w"> </span><span class="n">bidder_id</span><span class="p">,</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span><span class="w">
</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">bidsPerTime</span><span class="o"><-</span><span class="n">ifelse</span><span class="p">(</span><span class="n">numBids</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">numBids</span><span class="o">></span><span class="m">1</span><span class="p">,</span><span class="n">numBids</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">numBids</span><span class="o">/</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">timeActive</span><span class="p">,</span><span class="kc">NA</span><span class="p">)</span><span class="w">
</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">bidsPerTime</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">bidsPerTime</span><span class="o">==</span><span class="kc">Inf</span><span class="p">)]</span><span class="o"><-</span><span class="kc">NA</span><span class="w">
</span><span class="n">overallMean</span><span class="o"><-</span><span class="n">overallMean</span><span class="o">+</span><span class="n">mean</span><span class="p">(</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">bidsPerTime</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="o">*</span><span class="n">nrow</span><span class="p">(</span><span class="n">bidsTimeChunk</span><span class="p">[[</span><span class="n">i</span><span class="p">]])</span><span class="o">/</span><span class="n">nrow</span><span class="p">(</span><span class="n">bidsO</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">){</span><span class="w">
</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">bidsPerTime</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">bidsPerTime</span><span class="p">))]</span><span class="o"><-</span><span class="n">overallMean</span><span class="w">
</span><span class="n">bidsPerTime</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o"><-</span><span class="n">subset</span><span class="p">(</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]],</span><span class="n">select</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">bidsPerTime</span><span class="p">))</span><span class="w">
</span><span class="n">timeActive</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o"><-</span><span class="n">subset</span><span class="p">(</span><span class="n">firstLastBid</span><span class="p">[[</span><span class="n">i</span><span class="p">]],</span><span class="n">select</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">timeActive</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">#Take the average 'bid response time' for each bidder over the three time chunks</span><span class="w">
</span><span class="n">meanTimeDiffByBidder</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">merge</span><span class="p">(</span><span class="n">meanTimeDiffByBidder</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span><span class="n">meanTimeDiffByBidder</span><span class="p">[[</span><span class="m">2</span><span class="p">]],</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">all.x</span><span class="o">=</span><span class="nb">T</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="p">,</span><span class="n">meanTimeDiffByBidder</span><span class="p">[[</span><span class="m">3</span><span class="p">]],</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">all.x</span><span class="o">=</span><span class="nb">T</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">meanTimeDiffByBidder</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">meanTimeDiffByBidder</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="n">meanTimeBwBids</span><span class="o">=</span><span class="n">rowMeans</span><span class="p">(</span><span class="n">meanTimeDiffByBidder</span><span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="m">4</span><span class="p">],</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">))</span><span class="w">
</span><span class="c1">#Take the average of nBids/(last bid - first bid) for each bidder over the three time chunks</span><span class="w">
</span><span class="n">bidsPerTime</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">merge</span><span class="p">(</span><span class="n">bidsPerTime</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span><span class="n">bidsPerTime</span><span class="p">[[</span><span class="m">2</span><span class="p">]],</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">all.x</span><span class="o">=</span><span class="nb">T</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="p">,</span><span class="n">bidsPerTime</span><span class="p">[[</span><span class="m">3</span><span class="p">]],</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">all.x</span><span class="o">=</span><span class="nb">T</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">bidsPerTime</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">bidsPerTime</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="n">bidsPerTime</span><span class="o">=</span><span class="n">rowMeans</span><span class="p">(</span><span class="n">bidsPerTime</span><span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="m">4</span><span class="p">],</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">))</span><span class="w">
</span><span class="c1">#Take the sum of (last bid - first bid) for each bidder over the three time chunks</span><span class="w">
</span><span class="n">timeActive</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">merge</span><span class="p">(</span><span class="n">timeActive</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span><span class="n">timeActive</span><span class="p">[[</span><span class="m">2</span><span class="p">]],</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">all.x</span><span class="o">=</span><span class="nb">T</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="p">,</span><span class="n">timeActive</span><span class="p">[[</span><span class="m">3</span><span class="p">]],</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">all.x</span><span class="o">=</span><span class="nb">T</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">timeActive</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">timeActive</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="n">timeActive</span><span class="o">=</span><span class="n">rowSums</span><span class="p">(</span><span class="n">timeActive</span><span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="m">4</span><span class="p">],</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">))</span><span class="w">
</span><span class="c1">#Add to bidder characteristics matrix</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">meanTimeDiffByBidder</span><span class="p">,</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">bidsPerTime</span><span class="p">,</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">timeActive</span><span class="p">,</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">rm</span><span class="p">(</span><span class="s1">'bidsPerHour'</span><span class="p">,</span><span class="s1">'meanBidsPerHour'</span><span class="p">,</span><span class="s1">'varInBidsPerHour'</span><span class="p">,</span><span class="s1">'countriesPerHour'</span><span class="p">,</span><span class="s1">'meanCountriesPerHour'</span><span class="p">,</span><span class="s1">'varInCountriesPerHour'</span><span class="p">,</span><span class="s1">'auctionsPerHour'</span><span class="p">,</span><span class="s1">'meanAuctionsPerHour'</span><span class="p">,</span><span class="s1">'varInAuctionsPerHour'</span><span class="p">,</span><span class="s1">'devicesPerHour'</span><span class="p">,</span><span class="s1">'meanDevicesPerHour'</span><span class="p">,</span><span class="s1">'varInDevicesPerHour'</span><span class="p">,</span><span class="s1">'firstLastBid'</span><span class="p">,</span><span class="s1">'numBids'</span><span class="p">,</span><span class="s1">'overallMean'</span><span class="p">,</span><span class="s1">'bidsTimeChunk'</span><span class="p">,</span><span class="s1">'meanTimeDiffByBidder'</span><span class="p">,</span><span class="s1">'time1'</span><span class="p">,</span><span class="s1">'time2'</span><span class="p">,</span><span class="s1">'time3'</span><span class="p">,</span><span class="s1">'bidsO'</span><span class="p">,</span><span class="s1">'bidderIDsByCountry'</span><span class="p">)</span></code></pre></figure>
<h3 id="proportion-of-auctions-where-a-bidder-was-the-last-bidder">Proportion of auctions where a bidder was the last bidder</h3>
<p>I took this feature as potentially meaning the bidder won the auction.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#Propn of auctions where they were the final bidder..</span><span class="w">
</span><span class="n">lastBidsOnAuction</span><span class="o"><-</span><span class="n">ddply</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="o">~</span><span class="n">auction</span><span class="p">,</span><span class="n">summarise</span><span class="p">,</span><span class="n">time</span><span class="o">=</span><span class="nf">max</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">))</span><span class="w">
</span><span class="n">lastBidsOnAuction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">lastBidsOnAuction</span><span class="p">,</span><span class="w"> </span><span class="n">bids</span><span class="p">,</span><span class="w"> </span><span class="n">by.x</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"auction"</span><span class="p">,</span><span class="s2">"time"</span><span class="p">),</span><span class="w"> </span><span class="n">by.y</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"auction"</span><span class="p">,</span><span class="s2">"time"</span><span class="p">))</span><span class="w">
</span><span class="n">nLastBids</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">numLastBids</span><span class="o">=</span><span class="n">with</span><span class="p">(</span><span class="n">lastBidsOnAuction</span><span class="p">,</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">bid_id</span><span class="p">,</span><span class="w"> </span><span class="n">bidder_id</span><span class="p">,</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">nLastBids</span><span class="p">,</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">row.names</span><span class="p">(</span><span class="n">nLastBids</span><span class="p">)),</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o">$</span><span class="n">numLastBids</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="o">$</span><span class="n">numLastBids</span><span class="p">))]</span><span class="o"><</span><span class="m">-0</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o">$</span><span class="n">finalBidRate</span><span class="o"><-</span><span class="n">bidderCharacteristics</span><span class="o">$</span><span class="n">numLastBids</span><span class="o">/</span><span class="n">bidderCharacteristics</span><span class="o">$</span><span class="n">numAuctions</span></code></pre></figure>
<h3 id="mean-duration-of-auctions-a-bidder-participated-in">Mean duration of auctions a bidder participated in</h3>
<p>This didn’t turn out to be particularly useful:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#Mean duration of auctions participated in</span><span class="w">
</span><span class="n">auctionDurations</span><span class="o"><-</span><span class="n">ddply</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="o">~</span><span class="n">auction</span><span class="p">,</span><span class="n">summarise</span><span class="p">,</span><span class="n">firstBid</span><span class="o">=</span><span class="nf">min</span><span class="p">(</span><span class="n">time</span><span class="o">/</span><span class="m">1e6</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">),</span><span class="n">lastBid</span><span class="o">=</span><span class="nf">max</span><span class="p">(</span><span class="n">time</span><span class="o">/</span><span class="m">1e6</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">))</span><span class="w">
</span><span class="n">auctionDurations</span><span class="o">$</span><span class="n">dur</span><span class="o"><-</span><span class="n">auctionDurations</span><span class="o">$</span><span class="n">lastBid</span><span class="o">-</span><span class="n">auctionDurations</span><span class="o">$</span><span class="n">firstBid</span><span class="w">
</span><span class="n">auctionDurations</span><span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="m">4</span><span class="p">]</span><span class="o"><-</span><span class="n">auctionDurations</span><span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="m">4</span><span class="p">]</span><span class="o">/</span><span class="p">(</span><span class="n">hour</span><span class="o">/</span><span class="m">1e6</span><span class="p">)</span><span class="w">
</span><span class="n">auctionDurations</span><span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="m">3</span><span class="p">]</span><span class="o"><-</span><span class="n">auctionDurations</span><span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="m">3</span><span class="p">]</span><span class="o">-</span><span class="nf">min</span><span class="p">(</span><span class="n">auctionDurations</span><span class="p">[,</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">auctionDurations</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">with</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="n">dur</span><span class="o">=</span><span class="n">auctionDurations</span><span class="o">$</span><span class="n">dur</span><span class="p">[</span><span class="n">match</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">auction</span><span class="p">,</span><span class="n">auctionDurations</span><span class="o">$</span><span class="n">auction</span><span class="p">)]),</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">dur</span><span class="p">,</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">),</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o">$</span><span class="n">auctionDurations</span><span class="o"><-</span><span class="n">auctionDurations</span><span class="p">[</span><span class="n">match</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="o">$</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">rownames</span><span class="p">(</span><span class="n">auctionDurations</span><span class="p">)),</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">rm</span><span class="p">(</span><span class="n">auctionDurations</span><span class="p">)</span></code></pre></figure>
<h3 id="variance-in-proportion-of-bids-in-each-hour">Variance in proportion of bids in each hour</h3>
<p>The idea here was that a human might be more varied in terms of the hours of the day the bidded in, or maybe the opposite. For each of the 9 days I calculated the</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#Variance in proportion of bids in each hour...</span><span class="w">
</span><span class="n">bids</span><span class="o">$</span><span class="n">hour</span><span class="o"><-</span><span class="nf">floor</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">time</span><span class="o">/</span><span class="n">hour</span><span class="p">)</span><span class="w">
</span><span class="n">bids</span><span class="o">$</span><span class="n">hour24</span><span class="o"><-</span><span class="nf">floor</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">time</span><span class="o">/</span><span class="n">hour</span><span class="p">)</span><span class="o">%%</span><span class="m">24</span><span class="w">
</span><span class="n">bids</span><span class="o">$</span><span class="n">day</span><span class="o"><-</span><span class="nf">floor</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">time</span><span class="o">/</span><span class="n">day</span><span class="p">)</span><span class="w">
</span><span class="n">bids</span><span class="o">$</span><span class="n">hour</span><span class="o"><-</span><span class="n">bids</span><span class="o">$</span><span class="n">hour</span><span class="o">-</span><span class="nf">min</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">hour</span><span class="p">)</span><span class="w">
</span><span class="n">bids</span><span class="o">$</span><span class="n">day</span><span class="o"><-</span><span class="n">bids</span><span class="o">$</span><span class="n">day</span><span class="o">-</span><span class="nf">min</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">day</span><span class="p">)</span><span class="w">
</span><span class="n">bidsInEachHour</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">with</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">bid_id</span><span class="p">,</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">day</span><span class="p">,</span><span class="n">hour24</span><span class="p">),</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">))))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">d</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">day</span><span class="p">)){</span><span class="w">
</span><span class="n">bidsInEachHour</span><span class="p">[,</span><span class="n">grep</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"X"</span><span class="p">,</span><span class="n">d</span><span class="p">,</span><span class="s2">"\\."</span><span class="p">,</span><span class="n">sep</span><span class="o">=</span><span class="s1">''</span><span class="p">),</span><span class="n">colnames</span><span class="p">(</span><span class="n">bidsInEachHour</span><span class="p">))]</span><span class="o"><-</span><span class="w">
</span><span class="n">bidsInEachHour</span><span class="p">[,</span><span class="n">grep</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"X"</span><span class="p">,</span><span class="n">d</span><span class="p">,</span><span class="s2">"\\."</span><span class="p">,</span><span class="n">sep</span><span class="o">=</span><span class="s1">''</span><span class="p">),</span><span class="n">colnames</span><span class="p">(</span><span class="n">bidsInEachHour</span><span class="p">))]</span><span class="o">/</span><span class="w">
</span><span class="n">rowSums</span><span class="p">(</span><span class="n">bidsInEachHour</span><span class="p">[,</span><span class="n">grep</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"X"</span><span class="p">,</span><span class="n">d</span><span class="p">,</span><span class="s2">"\\."</span><span class="p">,</span><span class="n">sep</span><span class="o">=</span><span class="s1">''</span><span class="p">),</span><span class="n">colnames</span><span class="p">(</span><span class="n">bidsInEachHour</span><span class="p">))],</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bidsInEachHour</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">bidsInEachHour</span><span class="p">)]</span><span class="o"><</span><span class="m">-0</span><span class="w">
</span><span class="n">propnBids</span><span class="o"><-</span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">varPropnBids</span><span class="o"><-</span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">0</span><span class="o">:</span><span class="m">23</span><span class="p">){</span><span class="w">
</span><span class="n">propnBids</span><span class="p">[[</span><span class="n">n</span><span class="m">+1</span><span class="p">]]</span><span class="o"><-</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">bidsInEachHour</span><span class="p">[,</span><span class="n">grep</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"\\."</span><span class="p">,</span><span class="n">n</span><span class="p">,</span><span class="n">sep</span><span class="o">=</span><span class="s1">''</span><span class="p">),</span><span class="n">colnames</span><span class="p">(</span><span class="n">bidsInEachHour</span><span class="p">))],</span><span class="w">
</span><span class="n">bidder_id</span><span class="o">=</span><span class="n">row.names</span><span class="p">(</span><span class="n">bidsInEachHour</span><span class="p">))</span><span class="w">
</span><span class="n">propnBids</span><span class="p">[[</span><span class="n">n</span><span class="m">+1</span><span class="p">]]</span><span class="o"><-</span><span class="n">apply</span><span class="p">(</span><span class="n">propnBids</span><span class="p">[[</span><span class="n">n</span><span class="m">+1</span><span class="p">]],</span><span class="m">1</span><span class="p">,</span><span class="k">function</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">))</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">cbind</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">propnBids</span><span class="p">[[</span><span class="n">n</span><span class="m">+1</span><span class="p">]][</span><span class="w">
</span><span class="n">match</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">propnBids</span><span class="p">[[</span><span class="n">n</span><span class="m">+1</span><span class="p">]]),</span><span class="n">bidderCharacteristics</span><span class="o">$</span><span class="n">bidder_id</span><span class="p">)</span><span class="w">
</span><span class="p">])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="p">)[(</span><span class="n">ncol</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="p">)</span><span class="m">-23</span><span class="p">)</span><span class="o">:</span><span class="n">ncol</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="p">)]</span><span class="o"><-</span><span class="n">paste</span><span class="p">(</span><span class="s2">"hrVar"</span><span class="p">,</span><span class="m">0</span><span class="o">:</span><span class="m">23</span><span class="p">,</span><span class="n">sep</span><span class="o">=</span><span class="s2">""</span><span class="p">)</span></code></pre></figure>
<h3 id="mean-variance-skewness-and-kurtosis-of-bids-per-auction-bids-per-device-auctions-per-device-auctions-per-county-and-so-on">Mean, variance, skewness and kurtosis of bids per auction, bids per device… auctions per device, auctions per county… and so on</h3>
<p>Using the example of auctions per country, this feature was extracted by creating a table of bidder_id’s by countries and then placing the number of unique auctions in each country/bidder_id combination in the table cells. Row-wise mean, variance, skewness and kurtosis were then obtained. This was repeated for many possible combination of IPs, URLs, bids, auctions, devices, countries and hours:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">meanVarSkewKurt</span><span class="o"><-</span><span class="k">function</span><span class="p">(</span><span class="n">inData</span><span class="p">){</span><span class="w">
</span><span class="n">mean</span><span class="o"><-</span><span class="n">apply</span><span class="p">(</span><span class="n">inData</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">var</span><span class="o"><-</span><span class="n">apply</span><span class="p">(</span><span class="n">inData</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">mean</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">mean</span><span class="p">)]</span><span class="o"><-</span><span class="n">mean</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">var</span><span class="o"><-</span><span class="n">var</span><span class="o">/</span><span class="n">mean</span><span class="w">
</span><span class="n">var</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">var</span><span class="p">)]</span><span class="o"><-</span><span class="n">mean</span><span class="p">(</span><span class="n">var</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">skewness</span><span class="o"><-</span><span class="n">apply</span><span class="p">(</span><span class="n">inData</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">skewness</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">kurtosis</span><span class="o"><-</span><span class="n">apply</span><span class="p">(</span><span class="n">inData</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">kurtosis</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">skewness</span><span class="o"><-</span><span class="n">skewness</span><span class="o">/</span><span class="n">mean</span><span class="w">
</span><span class="n">kurtosis</span><span class="o"><-</span><span class="n">kurtosis</span><span class="o">/</span><span class="n">mean</span><span class="w">
</span><span class="n">skewness</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">skewness</span><span class="p">)]</span><span class="o"><-</span><span class="n">mean</span><span class="p">(</span><span class="n">skewness</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">kurtosis</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">kurtosis</span><span class="p">)]</span><span class="o"><-</span><span class="n">mean</span><span class="p">(</span><span class="n">kurtosis</span><span class="p">,</span><span class="n">na.rm</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">mean</span><span class="p">)</span><span class="o">==</span><span class="nf">names</span><span class="p">(</span><span class="n">skewness</span><span class="p">))</span><span class="o">==</span><span class="m">6614</span><span class="o">&</span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">mean</span><span class="p">)</span><span class="o">==</span><span class="nf">names</span><span class="p">(</span><span class="n">var</span><span class="p">))</span><span class="o">==</span><span class="m">6614</span><span class="p">)</span><span class="o">&</span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">mean</span><span class="p">)</span><span class="o">==</span><span class="nf">names</span><span class="p">(</span><span class="n">kurtosis</span><span class="p">))</span><span class="o">==</span><span class="m">6614</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">row.names</span><span class="o">=</span><span class="nf">names</span><span class="p">(</span><span class="n">mean</span><span class="p">),</span><span class="n">mean</span><span class="p">,</span><span class="n">var</span><span class="p">,</span><span class="n">skewness</span><span class="p">,</span><span class="n">kurtosis</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="s2">"ERR"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bids</span><span class="o">$</span><span class="n">hour</span><span class="o"><-</span><span class="nf">floor</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">time</span><span class="o">/</span><span class="n">hour</span><span class="p">)</span><span class="w">
</span><span class="n">names</span><span class="o"><-</span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">big</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">row.names</span><span class="o">=</span><span class="n">unique</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">bidder_id</span><span class="p">)[</span><span class="n">order</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">bids</span><span class="o">$</span><span class="n">bidder_id</span><span class="p">))])</span><span class="w">
</span><span class="n">system.time</span><span class="p">({</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">xPer</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ip'</span><span class="p">,</span><span class="s1">'url'</span><span class="p">,</span><span class="s1">'bid_id'</span><span class="p">,</span><span class="s1">'auction'</span><span class="p">,</span><span class="s1">'device'</span><span class="p">,</span><span class="s1">'country'</span><span class="p">)){</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">yy</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'auction'</span><span class="p">,</span><span class="s1">'device'</span><span class="p">,</span><span class="s1">'hour'</span><span class="p">,</span><span class="s1">'country'</span><span class="p">)){</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">xPer</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">yy</span><span class="p">){</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"_id"</span><span class="p">,</span><span class="s2">""</span><span class="p">,</span><span class="n">xPer</span><span class="p">),</span><span class="s2">"sPer"</span><span class="p">,</span><span class="n">yy</span><span class="p">,</span><span class="n">sep</span><span class="o">=</span><span class="s2">""</span><span class="p">))</span><span class="w">
</span><span class="n">big</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">big</span><span class="p">,</span><span class="w">
</span><span class="n">meanVarSkewKurt</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">with</span><span class="p">(</span><span class="n">bids</span><span class="p">,</span><span class="w"> </span><span class="n">tapply</span><span class="p">(</span><span class="n">get</span><span class="p">(</span><span class="n">xPer</span><span class="p">),</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">get</span><span class="p">(</span><span class="n">yy</span><span class="p">)),</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">x</span><span class="p">))))))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">ncol</span><span class="p">(</span><span class="n">big</span><span class="p">)</span><span class="o">==</span><span class="m">4</span><span class="p">){</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">big</span><span class="p">)</span><span class="o"><-</span><span class="n">paste</span><span class="p">(</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"_id"</span><span class="p">,</span><span class="s2">""</span><span class="p">,</span><span class="n">xPer</span><span class="p">),</span><span class="s2">"sPer"</span><span class="p">,</span><span class="n">yy</span><span class="p">,</span><span class="s2">"."</span><span class="p">,</span><span class="nf">c</span><span class="p">(</span><span class="s1">'m'</span><span class="p">,</span><span class="s1">'v'</span><span class="p">,</span><span class="s1">'s'</span><span class="p">,</span><span class="s1">'k'</span><span class="p">),</span><span class="n">sep</span><span class="o">=</span><span class="s2">""</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">big</span><span class="p">)</span><span class="o"><-</span><span class="nf">c</span><span class="p">(</span><span class="n">colnames</span><span class="p">(</span><span class="n">big</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">colnames</span><span class="p">(</span><span class="n">big</span><span class="p">))</span><span class="m">-4</span><span class="p">)],</span><span class="n">paste</span><span class="p">(</span><span class="n">gsub</span><span class="p">(</span><span class="s2">"_id"</span><span class="p">,</span><span class="s2">""</span><span class="p">,</span><span class="n">xPer</span><span class="p">),</span><span class="s2">"sPer"</span><span class="p">,</span><span class="n">yy</span><span class="p">,</span><span class="s2">"."</span><span class="p">,</span><span class="nf">c</span><span class="p">(</span><span class="s1">'m'</span><span class="p">,</span><span class="s1">'v'</span><span class="p">,</span><span class="s1">'s'</span><span class="p">,</span><span class="s1">'k'</span><span class="p">),</span><span class="n">sep</span><span class="o">=</span><span class="s2">""</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">})</span><span class="w">
</span><span class="n">bidderCharacteristics</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">bidderCharacteristics</span><span class="p">,</span><span class="n">big</span><span class="p">,</span><span class="n">by.x</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">by.y</span><span class="o">=</span><span class="s1">'row.names'</span><span class="p">,</span><span class="n">all.x</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span></code></pre></figure>
<h3 id="clean-up">Clean up</h3>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">rm</span><span class="p">(</span><span class="n">list</span><span class="o">=</span><span class="n">ls</span><span class="p">(</span><span class="n">all</span><span class="o">=</span><span class="nb">T</span><span class="p">)[</span><span class="o">!</span><span class="p">(</span><span class="n">ls</span><span class="p">(</span><span class="n">all</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="o">%in%</span><span class="nf">c</span><span class="p">(</span><span class="s1">'bidderCharacteristics'</span><span class="p">,</span><span class="s1">'oversampleOnes'</span><span class="p">,</span><span class="s1">'removeLowVarianceCols'</span><span class="p">,</span><span class="s1">'removeZeroVarianceCols'</span><span class="p">,</span><span class="s1">'wtc'</span><span class="p">,</span><span class="s1">'test'</span><span class="p">,</span><span class="s1">'train'</span><span class="p">,</span><span class="s1">'.Random.seed'</span><span class="p">))])</span></code></pre></figure>
<h2 id="modelling">Modelling</h2>
<h3 id="choice-of-model">Choice of model</h3>
<p>To set a benchmark, I first tried modelling the bot probability using logistic regression. As expected this wasn’t particularly effective. RandomForest was the next model I tried. I also experimented with adaboost and extraTrees from the caret package, as well as xgboost. None of these were able to outperform RandomForest, however.</p>
<h3 id="addressing-class-imbalance">Addressing class imbalance</h3>
<p>In the training set of some ~2000 bidders, there were only about 100 bots. I found I was able to improve both local cross validation (CV) and public leaderboard scores by over-sampling the bots prior to training the model. I achieved this through an R function:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">oversampleOnes</span><span class="o"><-</span><span class="k">function</span><span class="p">(</span><span class="n">dataIn</span><span class="p">,</span><span class="n">m</span><span class="p">){</span><span class="w">
</span><span class="n">out</span><span class="o"><-</span><span class="n">dataIn</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">z</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">m</span><span class="p">){</span><span class="w">
</span><span class="n">out</span><span class="o"><-</span><span class="n">rbind</span><span class="p">(</span><span class="n">out</span><span class="p">,</span><span class="n">dataIn</span><span class="p">[</span><span class="n">dataIn</span><span class="o">$</span><span class="n">outcome</span><span class="o">==</span><span class="m">1</span><span class="p">,])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">out</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<h3 id="local-training-and-testing---cross-validation">Local training and testing - cross validation</h3>
<p>While I did experiment with the cross validation features packaged with caret, I preferred the flexibilty of my own CV routine. I used 5- or 10-fold CV, depending on how much time I wanted to wait for resutls (usually used 10-fold).</p>
<p>I found my public leaderboard scores were usually higher than my CV scores, which I thought was a bit strange. I was probably overfitting the public leaderboard to some extent, or just getting lucky, because my final score on the private leaderboard ended up been much closer to my general CV performance.</p>
<p>The below loop gives the general gist of how I trained, tested and tuned my RF model using the training set:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1">#Create a list of 'evals' to store the evaluation and parameters</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="n">exists</span><span class="p">(</span><span class="s1">'i'</span><span class="p">)){</span><span class="n">evals</span><span class="o">=</span><span class="nf">list</span><span class="p">();</span><span class="n">i</span><span class="o">=</span><span class="m">0</span><span class="p">}</span><span class="w">
</span><span class="c1">#Use all 8 cpu cores</span><span class="w">
</span><span class="n">cores</span><span class="o">=</span><span class="m">8</span><span class="w">
</span><span class="n">num.chunk</span><span class="o">=</span><span class="m">8</span><span class="w">
</span><span class="c1">#os sets how many times to oversample the bots. os = 8 seemed to give best performance - this meant the entire training set went from having 100 to 900 bots.</span><span class="w">
</span><span class="n">os</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="w">
</span><span class="n">total.tree</span><span class="o">=</span><span class="m">3200</span><span class="p">;</span><span class="n">avg.tree</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">ceiling</span><span class="p">(</span><span class="n">total.tree</span><span class="o">/</span><span class="n">num.chunk</span><span class="p">)</span><span class="w">
</span><span class="c1">#Iterations is how many CV repeats to do... usually would just set high and stop the model at some point.</span><span class="w">
</span><span class="n">iterations</span><span class="o">=</span><span class="m">1000</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">iterAtion</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">3</span><span class="o">:</span><span class="n">iterations</span><span class="p">){</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="n">iterAtion</span><span class="p">)</span><span class="w">
</span><span class="c1">#Initialise samples for 10-fold cross validation</span><span class="w">
</span><span class="n">cv</span><span class="o">=</span><span class="m">10</span><span class="w">
</span><span class="n">trainIdx</span><span class="o"><-</span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">testIdx</span><span class="o"><-</span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">tmp</span><span class="o"><-</span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">trainChar</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">cv</span><span class="p">){</span><span class="w">
</span><span class="n">testIdx</span><span class="p">[[</span><span class="n">j</span><span class="p">]]</span><span class="o"><-</span><span class="n">tmp</span><span class="p">[</span><span class="nf">round</span><span class="p">((</span><span class="n">j</span><span class="m">-1</span><span class="p">)</span><span class="o">*</span><span class="nf">floor</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">trainChar</span><span class="p">)</span><span class="o">/</span><span class="n">cv</span><span class="p">)</span><span class="m">+1</span><span class="p">,</span><span class="m">0</span><span class="p">)</span><span class="o">:</span><span class="nf">min</span><span class="p">(</span><span class="nf">round</span><span class="p">(</span><span class="n">j</span><span class="o">*</span><span class="nf">floor</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">trainChar</span><span class="p">)</span><span class="o">/</span><span class="n">cv</span><span class="p">)</span><span class="m">+1</span><span class="p">,</span><span class="m">0</span><span class="p">),</span><span class="nf">length</span><span class="p">(</span><span class="n">tmp</span><span class="p">))]</span><span class="w">
</span><span class="n">trainIdx</span><span class="p">[[</span><span class="n">j</span><span class="p">]]</span><span class="o"><-</span><span class="n">tmp</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="o">!</span><span class="n">tmp</span><span class="o">%in%</span><span class="n">testIdx</span><span class="p">[[</span><span class="n">j</span><span class="p">]])]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">#Initialise multicore:</span><span class="w">
</span><span class="n">cl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">makeCluster</span><span class="p">(</span><span class="n">cores</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"SOCK"</span><span class="p">);</span><span class="n">registerDoSNOW</span><span class="p">(</span><span class="n">cl</span><span class="p">)</span><span class="w">
</span><span class="c1">#These for loops were used for tuning RF parameters like mtry.</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">mtry</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">18</span><span class="p">)){</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">cvIdx</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">cv</span><span class="p">){</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">cvIdx</span><span class="p">)</span><span class="w">
</span><span class="n">rf_fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">foreach</span><span class="p">(</span><span class="n">ntree</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">avg.tree</span><span class="p">,</span><span class="w"> </span><span class="n">num.chunk</span><span class="p">),</span><span class="w"> </span><span class="n">.combine</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">combine</span><span class="p">,</span><span class="w">
</span><span class="n">.packages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"randomForest"</span><span class="p">))</span><span class="w"> </span><span class="o">%dopar%</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">randomForest</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">oversampleOnes</span><span class="p">(</span><span class="n">trainChar</span><span class="p">[</span><span class="n">trainIdx</span><span class="p">[[</span><span class="n">cvIdx</span><span class="p">]],</span><span class="n">allVars</span><span class="p">],</span><span class="n">os</span><span class="p">)[,</span><span class="m">-1</span><span class="p">],</span><span class="w">
</span><span class="n">y</span><span class="o">=</span><span class="n">oversampleOnes</span><span class="p">(</span><span class="n">trainChar</span><span class="p">[</span><span class="n">trainIdx</span><span class="p">[[</span><span class="n">cvIdx</span><span class="p">]],</span><span class="n">allVars</span><span class="p">],</span><span class="n">os</span><span class="p">)[,</span><span class="m">1</span><span class="p">],</span><span class="w">
</span><span class="n">ntree</span><span class="o">=</span><span class="n">ntree</span><span class="p">,</span><span class="w"> </span><span class="n">mtry</span><span class="o">=</span><span class="n">mtry</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="c1">#Make and store predictions and variable importance vector</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">cvIdx</span><span class="o">==</span><span class="m">1</span><span class="p">){</span><span class="w">
</span><span class="n">imps</span><span class="o"><-</span><span class="n">importance</span><span class="p">(</span><span class="n">rf_fit</span><span class="p">,</span><span class="n">class</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">trainCharPredictions</span><span class="o"><-</span><span class="n">subset</span><span class="p">(</span><span class="n">trainChar</span><span class="p">,</span><span class="n">select</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="n">bidder_id</span><span class="p">,</span><span class="n">outcome</span><span class="p">))</span><span class="w">
</span><span class="n">trainCharPredictions</span><span class="o">$</span><span class="n">prediction.rf</span><span class="o"><-</span><span class="kc">NA</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">imp</span><span class="o"><-</span><span class="n">importance</span><span class="p">(</span><span class="n">rf_fit</span><span class="p">,</span><span class="n">class</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">imps</span><span class="o"><-</span><span class="n">imps</span><span class="o">+</span><span class="n">imp</span><span class="p">[</span><span class="n">match</span><span class="p">(</span><span class="n">rownames</span><span class="p">(</span><span class="n">imps</span><span class="p">),</span><span class="n">rownames</span><span class="p">(</span><span class="n">imp</span><span class="p">))]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">trainCharPredictions</span><span class="o">$</span><span class="n">prediction.rf</span><span class="p">[</span><span class="n">testIdx</span><span class="p">[[</span><span class="n">cvIdx</span><span class="p">]]]</span><span class="o"><-</span><span class="n">predict</span><span class="p">(</span><span class="n">rf_fit</span><span class="p">,</span><span class="w"> </span><span class="n">trainChar</span><span class="p">[</span><span class="n">testIdx</span><span class="p">[[</span><span class="n">cvIdx</span><span class="p">]],</span><span class="n">allVars</span><span class="p">],</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)[,</span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"RF performance: "</span><span class="p">,</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">slot</span><span class="p">(</span><span class="n">performance</span><span class="p">(</span><span class="n">prediction</span><span class="p">(</span><span class="n">trainCharPredictions</span><span class="o">$</span><span class="n">prediction.rf</span><span class="p">,</span><span class="n">trainChar</span><span class="o">$</span><span class="n">outcome</span><span class="p">),</span><span class="w"> </span><span class="s2">"auc"</span><span class="p">),</span><span class="w"> </span><span class="s2">"y.values"</span><span class="p">)[[</span><span class="m">1</span><span class="p">]],</span><span class="m">3</span><span class="p">),</span><span class="w">
</span><span class="n">sep</span><span class="o">=</span><span class="s2">""</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">imps</span><span class="o"><-</span><span class="n">imps</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">imps</span><span class="p">[,</span><span class="m">1</span><span class="p">]),]</span><span class="w">
</span><span class="n">eval</span><span class="o"><-</span><span class="n">cbind</span><span class="p">(</span><span class="w">
</span><span class="n">os</span><span class="p">,</span><span class="n">mtry</span><span class="p">,</span><span class="n">cv</span><span class="p">,</span><span class="n">iterAtion</span><span class="p">,</span><span class="w">
</span><span class="n">slot</span><span class="p">(</span><span class="n">performance</span><span class="p">(</span><span class="n">prediction</span><span class="p">(</span><span class="n">trainCharPredictions</span><span class="o">$</span><span class="n">prediction.rf</span><span class="p">,</span><span class="n">trainCharPredictions</span><span class="o">$</span><span class="n">outcome</span><span class="p">),</span><span class="w"> </span><span class="s2">"auc"</span><span class="p">),</span><span class="w"> </span><span class="s2">"y.values"</span><span class="p">)[[</span><span class="m">1</span><span class="p">]],</span><span class="w">
</span><span class="n">total.tree</span><span class="p">,</span><span class="w">
</span><span class="nf">length</span><span class="p">(</span><span class="n">allVars</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">eval</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">eval</span><span class="p">)</span><span class="o"><-</span><span class="nf">c</span><span class="p">(</span><span class="s1">'os'</span><span class="p">,</span><span class="s1">'mtry'</span><span class="p">,</span><span class="s1">'cv folds'</span><span class="p">,</span><span class="s1">'seed'</span><span class="p">,</span><span class="s1">'RFAUC'</span><span class="p">,</span><span class="s1">'ntrees'</span><span class="p">,</span><span class="s1">'nvars'</span><span class="p">)</span><span class="w">
</span><span class="n">i</span><span class="o">=</span><span class="n">i</span><span class="m">+1</span><span class="w">
</span><span class="n">evals</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">=</span><span class="nf">list</span><span class="p">(</span><span class="n">eval</span><span class="p">,</span><span class="n">imps</span><span class="p">,</span><span class="n">paste</span><span class="p">(</span><span class="n">allVars</span><span class="p">,</span><span class="n">collapse</span><span class="o">=</span><span class="s2">","</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">stopCluster</span><span class="p">(</span><span class="n">cl</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<h3 id="fitting-the-final-model">Fitting the final model</h3>
<p>After testing models out via cross validation, the below general code snippet was used to make submissions:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">mtry</span><span class="o">=</span><span class="m">18</span><span class="w">
</span><span class="n">total.tree</span><span class="o">=</span><span class="m">8000</span><span class="p">;</span><span class="n">avg.tree</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">ceiling</span><span class="p">(</span><span class="n">total.tree</span><span class="o">/</span><span class="n">num.chunk</span><span class="p">)</span><span class="w">
</span><span class="n">os</span><span class="o">=</span><span class="m">8</span><span class="w">
</span><span class="n">cl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">makeCluster</span><span class="p">(</span><span class="n">cores</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"SOCK"</span><span class="p">);</span><span class="n">registerDoSNOW</span><span class="p">(</span><span class="n">cl</span><span class="p">)</span><span class="w">
</span><span class="n">rf_fit_full</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">foreach</span><span class="p">(</span><span class="n">ntree</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">avg.tree</span><span class="p">,</span><span class="w"> </span><span class="n">num.chunk</span><span class="p">),</span><span class="w"> </span><span class="n">.combine</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">combine</span><span class="p">,</span><span class="w">
</span><span class="n">.packages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"randomForest"</span><span class="p">))</span><span class="w"> </span><span class="o">%dopar%</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">randomForest</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">oversampleOnes</span><span class="p">(</span><span class="n">trainChar</span><span class="p">[,</span><span class="n">allVars</span><span class="p">],</span><span class="n">os</span><span class="p">)[,</span><span class="m">-1</span><span class="p">],</span><span class="w">
</span><span class="n">y</span><span class="o">=</span><span class="n">oversampleOnes</span><span class="p">(</span><span class="n">trainChar</span><span class="p">[,</span><span class="n">allVars</span><span class="p">],</span><span class="n">os</span><span class="p">)[,</span><span class="m">1</span><span class="p">],</span><span class="w">
</span><span class="n">ntree</span><span class="o">=</span><span class="n">ntree</span><span class="p">,</span><span class="w"> </span><span class="n">mtry</span><span class="o">=</span><span class="n">mtry</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="c1">#Change to just trainCharRestricted to use entire training set.</span><span class="w">
</span><span class="n">stopCluster</span><span class="p">(</span><span class="n">cl</span><span class="p">)</span><span class="w">
</span><span class="n">testChar</span><span class="o">$</span><span class="n">prediction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">rf_fit_full</span><span class="p">,</span><span class="w"> </span><span class="n">testChar</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)[,</span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="c1">#Give bidders not in the bids dataset the average probability of being a bot.</span><span class="w">
</span><span class="n">prob</span><span class="o"><-</span><span class="nf">sum</span><span class="p">(</span><span class="n">train</span><span class="o">$</span><span class="n">outcome</span><span class="p">)</span><span class="o">/</span><span class="n">nrow</span><span class="p">(</span><span class="n">train</span><span class="p">)</span><span class="w">
</span><span class="n">outPred</span><span class="o"><-</span><span class="n">merge</span><span class="p">(</span><span class="n">testChar</span><span class="p">,</span><span class="n">test</span><span class="p">,</span><span class="n">by</span><span class="o">=</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="n">all.y</span><span class="o">=</span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">outPred</span><span class="o"><-</span><span class="n">outPred</span><span class="p">[,</span><span class="nf">c</span><span class="p">(</span><span class="s1">'bidder_id'</span><span class="p">,</span><span class="s1">'prediction'</span><span class="p">)]</span><span class="w">
</span><span class="n">outPred</span><span class="p">[</span><span class="n">which</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">outPred</span><span class="p">[,</span><span class="m">2</span><span class="p">])),</span><span class="m">2</span><span class="p">]</span><span class="o"><-</span><span class="n">prob</span><span class="w">
</span><span class="n">write.csv</span><span class="p">(</span><span class="n">outPred</span><span class="p">,</span><span class="n">file</span><span class="o">=</span><span class="s1">'submission.csv'</span><span class="p">,</span><span class="n">row.names</span><span class="o">=</span><span class="nb">F</span><span class="p">)</span></code></pre></figure>
<h3 id="variable-selection-for-the-final-model">Variable selection for the final model</h3>
<p>I didn’t end up using all of the variables generated during the feature engineering stage in my final model (there were some 1400 in total), though some of my best-scoring models did include as many as 1200 features. The ‘core’ model had around 315 predictor variables. These particular 315 came out of various tests using RF importance, balanced with my findings on what seemed to just work. When I added the mean/variance/skewness/kurtosis set of features, performance seemed to degrade, so a number of these features ended up being excluded. I tried to address the high dimensionality problem in various ways - reducing sets of highly-correlated variables, and removing variables with low RF importance scores - however none of these seemed to really improve performance. The takeaway from that for me was that RandomForest seems to be very effective at extracting all of the relevant information from the variables you give it, without being confounded by superfluous or barely-relevant variables. I’m not sure if this is always the case, but it seemed to be so here - removing variables that seemed like they should be useless in a statistical sense usually reduced model accuracy.</p>
<p>If you’re curious, here is the vector of ‘best’ variables that I used in the final model (50 URL principal components variables are all that’s excluded from this list):</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">allVars</span><span class="o"><-</span><span class="nf">c</span><span class="p">(</span><span class="s2">"outcome"</span><span class="p">,</span><span class="s2">"numLastBids"</span><span class="p">,</span><span class="s2">"timeActive"</span><span class="p">,</span><span class="s2">"bidsPerTime"</span><span class="p">,</span><span class="s2">"meanTimeBwBids"</span><span class="p">,</span><span class="s2">"bidsPerhour.m"</span><span class="p">,</span><span class="s2">"bidsPerhour.v"</span><span class="p">,</span><span class="s2">"bidsPerhour.s"</span><span class="p">,</span><span class="s2">"bidsPerhour.k"</span><span class="p">,</span><span class="s2">"auctionsPerhour.m"</span><span class="p">,</span><span class="s2">"auctionsPerhour.v"</span><span class="p">,</span><span class="s2">"auctionsPerhour.s"</span><span class="p">,</span><span class="s2">"auctionsPerhour.k"</span><span class="p">,</span><span class="s2">"urlsPerhour.m"</span><span class="p">,</span><span class="s2">"urlsPerhour.v"</span><span class="p">,</span><span class="s2">"urlsPerhour.s"</span><span class="p">,</span><span class="s2">"urlsPerhour.k"</span><span class="p">,</span><span class="s2">"ipsPerhour.m"</span><span class="p">,</span><span class="s2">"ipsPerhour.v"</span><span class="p">,</span><span class="s2">"ipsPerhour.s"</span><span class="p">,</span><span class="s2">"ipsPerhour.k"</span><span class="p">,</span><span class="s2">"X0"</span><span class="p">,</span><span class="s2">"X1"</span><span class="p">,</span><span class="s2">"X2"</span><span class="p">,</span><span class="s2">"X3"</span><span class="p">,</span><span class="s2">"X4"</span><span class="p">,</span><span class="s2">"X5"</span><span class="p">,</span><span class="s2">"X6"</span><span class="p">,</span><span class="s2">"X7"</span><span class="p">,</span><span class="s2">"X8"</span><span class="p">,</span><span class="s2">"X9"</span><span class="p">,</span><span class="s2">"X10"</span><span class="p">,</span><span class="s2">"X11"</span><span class="p">,</span><span class="s2">"X12"</span><span class="p">,</span><span class="s2">"X13"</span><span class="p">,</span><span class="s2">"X14"</span><span class="p">,</span><span class="s2">"X15"</span><span class="p">,</span><span class="s2">"X16"</span><span class="p">,</span><span class="s2">"X17"</span><span class="p">,</span><span class="s2">"X18"</span><span class="p">,</span><span class="s2">"X19"</span><span class="p">,</span><span class="s2">"X20"</span><span class="p">,</span><span class="s2">"X21"</span><span class="p">,</span><span class="s2">"X22"</span><span class="p">,</span><span class="s2">"X23"</span><span class="p">,</span><span class="s2">"numURLs"</span><span class="p">,</span><span class="s2">"numDevices"</span><span class="p">,</span><span class="s2">"numBids"</span><span class="p">,</span><span class="s2">"numAuctions"</span><span class="p">,</span><span class="s2">"numIPs"</span><span class="p">,</span><span class="s2">"numCountries"</span><span class="p">,</span><span class="s2">"ipPop"</span><span class="p">,</span><span class="s2">"dvc.1"</span><span class="p">,</span><span class="s2">"dvc.2"</span><span class="p">,</span><span class="s2">"dvc.3"</span><span class="p">,</span><span class="s2">"dvc.4"</span><span class="p">,</span><span class="s2">"dvc.5"</span><span class="p">,</span><span class="s2">"dvc.6"</span><span class="p">,</span><span class="s2">"dvc.7"</span><span class="p">,</span><span class="s2">"dvc.8"</span><span class="p">,</span><span class="s2">"dvc.9"</span><span class="p">,</span><span class="s2">"dvc.10"</span><span class="p">,</span><span class="s2">"dvc.11"</span><span class="p">,</span><span class="s2">"dvc.12"</span><span class="p">,</span><span class="s2">"dvc.13"</span><span class="p">,</span><span class="s2">"dvc.14"</span><span class="p">,</span><span class="s2">"dvc.15"</span><span class="p">,</span><span class="s2">"dvc.16"</span><span class="p">,</span><span class="s2">"dvc.17"</span><span class="p">,</span><span class="s2">"dvc.18"</span><span class="p">,</span><span class="s2">"dvc.19"</span><span class="p">,</span><span class="s2">"dvc.20"</span><span class="p">,</span><span class="s2">"dvc.21"</span><span class="p">,</span><span class="s2">"dvc.22"</span><span class="p">,</span><span class="s2">"dvc.23"</span><span class="p">,</span><span class="s2">"dvc.24"</span><span class="p">,</span><span class="s2">"dvc.25"</span><span class="p">,</span><span class="s2">"dvc.26"</span><span class="p">,</span><span class="s2">"dvc.27"</span><span class="p">,</span><span class="s2">"dvc.28"</span><span class="p">,</span><span class="s2">"dvc.29"</span><span class="p">,</span><span class="s2">"dvc.30"</span><span class="p">,</span><span class="s2">"dvc.31"</span><span class="p">,</span><span class="s2">"dvc.32"</span><span class="p">,</span><span class="s2">"dvc.33"</span><span class="p">,</span><span class="s2">"dvc.34"</span><span class="p">,</span><span class="s2">"dvc.35"</span><span class="p">,</span><span class="s2">"dvc.36"</span><span class="p">,</span><span class="s2">"dvc.37"</span><span class="p">,</span><span class="s2">"dvc.38"</span><span class="p">,</span><span class="s2">"dvc.39"</span><span class="p">,</span><span class="s2">"dvc.41"</span><span class="p">,</span><span class="s2">"dvc.42"</span><span class="p">,</span><span class="s2">"dvc.43"</span><span class="p">,</span><span class="s2">"dvc.45"</span><span class="p">,</span><span class="s2">"dvc.46"</span><span class="p">,</span><span class="s2">"dvc.47"</span><span class="p">,</span><span class="s2">"dvc.49"</span><span class="p">,</span><span class="s2">"dvc.50"</span><span class="p">,</span><span class="s2">"dvc.51"</span><span class="p">,</span><span class="s2">"dvc.52"</span><span class="p">,</span><span class="s2">"dvc.53"</span><span class="p">,</span><span class="s2">"dvc.56"</span><span class="p">,</span><span class="s2">"dvc.57"</span><span class="p">,</span><span class="s2">"dvc.58"</span><span class="p">,</span><span class="s2">"dvc.59"</span><span class="p">,</span><span class="s2">"dvc.60"</span><span class="p">,</span><span class="s2">"dvc.61"</span><span class="p">,</span><span class="s2">"dvc.62"</span><span class="p">,</span><span class="s2">"dvc.63"</span><span class="p">,</span><span class="s2">"dvc.64"</span><span class="p">,</span><span class="s2">"dvc.65"</span><span class="p">,</span><span class="s2">"dvc.67"</span><span class="p">,</span><span class="s2">"dvc.69"</span><span class="p">,</span><span class="s2">"dvc.70"</span><span class="p">,</span><span class="s2">"dvc.71"</span><span class="p">,</span><span class="s2">"dvc.72"</span><span class="p">,</span><span class="s2">"dvc.73"</span><span class="p">,</span><span class="s2">"dvc.74"</span><span class="p">,</span><span class="s2">"dvc.75"</span><span class="p">,</span><span class="s2">"dvc.76"</span><span class="p">,</span><span class="s2">"dvc.77"</span><span class="p">,</span><span class="s2">"dvc.78"</span><span class="p">,</span><span class="s2">"dvc.79"</span><span class="p">,</span><span class="s2">"dvc.80"</span><span class="p">,</span><span class="s2">"dvc.81"</span><span class="p">,</span><span class="s2">"dvc.82"</span><span class="p">,</span><span class="s2">"dvc.83"</span><span class="p">,</span><span class="s2">"dvc.84"</span><span class="p">,</span><span class="s2">"dvc.85"</span><span class="p">,</span><span class="s2">"dvc.86"</span><span class="p">,</span><span class="s2">"dvc.87"</span><span class="p">,</span><span class="s2">"dvc.88"</span><span class="p">,</span><span class="s2">"dvc.90"</span><span class="p">,</span><span class="s2">"dvc.91"</span><span class="p">,</span><span class="s2">"dvc.92"</span><span class="p">,</span><span class="s2">"dvc.93"</span><span class="p">,</span><span class="s2">"dvc.94"</span><span class="p">,</span><span class="s2">"dvc.95"</span><span class="p">,</span><span class="s2">"dvc.96"</span><span class="p">,</span><span class="s2">"dvc.98"</span><span class="p">,</span><span class="s2">"dvc.99"</span><span class="p">,</span><span class="s2">"dvc.100"</span><span class="p">,</span><span class="s2">"dvc.101"</span><span class="p">,</span><span class="s2">"dvc.102"</span><span class="p">,</span><span class="s2">"dvc.104"</span><span class="p">,</span><span class="s2">"dvc.105"</span><span class="p">,</span><span class="s2">"dvc.107"</span><span class="p">,</span><span class="s2">"dvc.108"</span><span class="p">,</span><span class="s2">"dvc.109"</span><span class="p">,</span><span class="s2">"dvc.110"</span><span class="p">,</span><span class="s2">"dvc.111"</span><span class="p">,</span><span class="s2">"dvc.112"</span><span class="p">,</span><span class="s2">"dvc.113"</span><span class="p">,</span><span class="s2">"dvc.114"</span><span class="p">,</span><span class="s2">"dvc.116"</span><span class="p">,</span><span class="s2">"dvc.117"</span><span class="p">,</span><span class="s2">"dvc.118"</span><span class="p">,</span><span class="s2">"dvc.119"</span><span class="p">,</span><span class="s2">"dvc.120"</span><span class="p">,</span><span class="s2">"dvc.122"</span><span class="p">,</span><span class="s2">"dvc.123"</span><span class="p">,</span><span class="s2">"dvc.124"</span><span class="p">,</span><span class="s2">"dvc.125"</span><span class="p">,</span><span class="s2">"dvc.126"</span><span class="p">,</span><span class="s2">"dvc.127"</span><span class="p">,</span><span class="s2">"dvc.128"</span><span class="p">,</span><span class="s2">"dvc.129"</span><span class="p">,</span><span class="s2">"dvc.130"</span><span class="p">,</span><span class="s2">"dvc.131"</span><span class="p">,</span><span class="s2">"dvc.132"</span><span class="p">,</span><span class="s2">"dvc.133"</span><span class="p">,</span><span class="s2">"dvc.134"</span><span class="p">,</span><span class="s2">"dvc.135"</span><span class="p">,</span><span class="s2">"dvc.137"</span><span class="p">,</span><span class="s2">"dvc.138"</span><span class="p">,</span><span class="s2">"dvc.139"</span><span class="p">,</span><span class="s2">"dvc.140"</span><span class="p">,</span><span class="s2">"dvc.141"</span><span class="p">,</span><span class="s2">"dvc.142"</span><span class="p">,</span><span class="s2">"dvc.143"</span><span class="p">,</span><span class="s2">"dvc.144"</span><span class="p">,</span><span class="s2">"dvc.146"</span><span class="p">,</span><span class="s2">"dvc.147"</span><span class="p">,</span><span class="s2">"dvc.148"</span><span class="p">,</span><span class="s2">"dvc.150"</span><span class="p">,</span><span class="s2">"dvc.153"</span><span class="p">,</span><span class="s2">"dvc.154"</span><span class="p">,</span><span class="s2">"dvc.155"</span><span class="p">,</span><span class="s2">"dvc.157"</span><span class="p">,</span><span class="s2">"dvc.159"</span><span class="p">,</span><span class="s2">"dvc.162"</span><span class="p">,</span><span class="s2">"dvc.163"</span><span class="p">,</span><span class="s2">"dvc.164"</span><span class="p">,</span><span class="s2">"dvc.166"</span><span class="p">,</span><span class="s2">"dvc.168"</span><span class="p">,</span><span class="s2">"dvc.169"</span><span class="p">,</span><span class="s2">"dvc.170"</span><span class="p">,</span><span class="s2">"dvc.171"</span><span class="p">,</span><span class="s2">"dvc.173"</span><span class="p">,</span><span class="s2">"dvc.174"</span><span class="p">,</span><span class="s2">"dvc.175"</span><span class="p">,</span><span class="s2">"dvc.176"</span><span class="p">,</span><span class="s2">"dvc.177"</span><span class="p">,</span><span class="s2">"dvc.179"</span><span class="p">,</span><span class="s2">"dvc.181"</span><span class="p">,</span><span class="s2">"dvc.182"</span><span class="p">,</span><span class="s2">"dvc.183"</span><span class="p">,</span><span class="s2">"dvc.184"</span><span class="p">,</span><span class="s2">"dvc.185"</span><span class="p">,</span><span class="s2">"dvc.186"</span><span class="p">,</span><span class="s2">"dvc.187"</span><span class="p">,</span><span class="s2">"dvc.189"</span><span class="p">,</span><span class="s2">"dvc.190"</span><span class="p">,</span><span class="s2">"dvc.191"</span><span class="p">,</span><span class="s2">"dvc.192"</span><span class="p">,</span><span class="s2">"dvc.194"</span><span class="p">,</span><span class="s2">"dvc.195"</span><span class="p">,</span><span class="s2">"dvc.196"</span><span class="p">,</span><span class="s2">"dvc.197"</span><span class="p">,</span><span class="s2">"dvc.198"</span><span class="p">,</span><span class="s2">"dvc.199"</span><span class="p">,</span><span class="s2">"dvc.200"</span><span class="p">,</span><span class="s2">"dvc.201"</span><span class="p">,</span><span class="s2">"dvc.202"</span><span class="p">,</span><span class="s2">"dvc.203"</span><span class="p">,</span><span class="s2">"dvc.204"</span><span class="p">,</span><span class="s2">"dvc.205"</span><span class="p">,</span><span class="s2">"dvc.206"</span><span class="p">,</span><span class="s2">"dvc.207"</span><span class="p">,</span><span class="s2">"dvc.208"</span><span class="p">,</span><span class="s2">"dvc.209"</span><span class="p">,</span><span class="s2">"dvc.210"</span><span class="p">,</span><span class="s2">"dvc.211"</span><span class="p">,</span><span class="s2">"dvc.212"</span><span class="p">,</span><span class="s2">"dvc.213"</span><span class="p">,</span><span class="s2">"dvc.214"</span><span class="p">,</span><span class="s2">"dvc.215"</span><span class="p">,</span><span class="s2">"dvc.216"</span><span class="p">,</span><span class="s2">"dvc.217"</span><span class="p">,</span><span class="s2">"dvc.219"</span><span class="p">,</span><span class="s2">"dvc.220"</span><span class="p">,</span><span class="s2">"dvc.221"</span><span class="p">,</span><span class="s2">"dvc.222"</span><span class="p">,</span><span class="s2">"dvc.224"</span><span class="p">,</span><span class="s2">"dvc.225"</span><span class="p">,</span><span class="s2">"dvc.226"</span><span class="p">,</span><span class="s2">"dvc.227"</span><span class="p">,</span><span class="s2">"dvc.228"</span><span class="p">,</span><span class="s2">"dvc.229"</span><span class="p">,</span><span class="s2">"dvc.230"</span><span class="p">,</span><span class="s2">"dvc.231"</span><span class="p">,</span><span class="s2">"dvc.232"</span><span class="p">,</span><span class="s2">"dvc.233"</span><span class="p">,</span><span class="s2">"dvc.234"</span><span class="p">,</span><span class="s2">"dvc.235"</span><span class="p">,</span><span class="s2">"dvc.236"</span><span class="p">,</span><span class="s2">"dvc.237"</span><span class="p">,</span><span class="s2">"dvc.238"</span><span class="p">,</span><span class="s2">"finalBidRate"</span><span class="p">,</span><span class="s2">"cty.ae"</span><span class="p">,</span><span class="s2">"cty.ar"</span><span class="p">,</span><span class="s2">"cty.au"</span><span class="p">,</span><span class="s2">"cty.az"</span><span class="p">,</span><span class="s2">"cty.bd"</span><span class="p">,</span><span class="s2">"cty.bf"</span><span class="p">,</span><span class="s2">"cty.bn"</span><span class="p">,</span><span class="s2">"cty.br"</span><span class="p">,</span><span class="s2">"cty.ca"</span><span class="p">,</span><span class="s2">"cty.ch"</span><span class="p">,</span><span class="s2">"cty.cn"</span><span class="p">,</span><span class="s2">"cty.de"</span><span class="p">,</span><span class="s2">"cty.dj"</span><span class="p">,</span><span class="s2">"cty.ec"</span><span class="p">,</span><span class="s2">"cty.es"</span><span class="p">,</span><span class="s2">"cty.et"</span><span class="p">,</span><span class="s2">"cty.eu"</span><span class="p">,</span><span class="s2">"cty.fr"</span><span class="p">,</span><span class="s2">"cty.gt"</span><span class="p">,</span><span class="s2">"cty.id"</span><span class="p">,</span><span class="s2">"cty.in"</span><span class="p">,</span><span class="s2">"cty.it"</span><span class="p">,</span><span class="s2">"cty.ke"</span><span class="p">,</span><span class="s2">"cty.lk"</span><span class="p">,</span><span class="s2">"cty.lt"</span><span class="p">,</span><span class="s2">"cty.lv"</span><span class="p">,</span><span class="s2">"cty.mr"</span><span class="p">,</span><span class="s2">"cty.mx"</span><span class="p">,</span><span class="s2">"cty.my"</span><span class="p">,</span><span class="s2">"cty.ng"</span><span class="p">,</span><span class="s2">"cty.no"</span><span class="p">,</span><span class="s2">"cty.np"</span><span class="p">,</span><span class="s2">"cty.ph"</span><span class="p">,</span><span class="s2">"cty.pk"</span><span class="p">,</span><span class="s2">"cty.qa"</span><span class="p">,</span><span class="s2">"cty.ro"</span><span class="p">,</span><span class="s2">"cty.rs"</span><span class="p">,</span><span class="s2">"cty.ru"</span><span class="p">,</span><span class="s2">"cty.sa"</span><span class="p">,</span><span class="s2">"cty.sd"</span><span class="p">,</span><span class="s2">"cty.sg"</span><span class="p">,</span><span class="s2">"cty.th"</span><span class="p">,</span><span class="s2">"cty.tr"</span><span class="p">,</span><span class="s2">"cty.ua"</span><span class="p">,</span><span class="s2">"cty.uk"</span><span class="p">,</span><span class="s2">"cty.us"</span><span class="p">,</span><span class="s2">"cty.vn"</span><span class="p">,</span><span class="s2">"cty.za"</span><span class="p">,</span><span class="s2">"sumHrVar"</span><span class="p">,</span><span class="s2">"url.150"</span><span class="p">,</span><span class="s2">"ip.557"</span><span class="p">,</span><span class="s2">"ip.283"</span><span class="p">,</span><span class="s2">"ip.549"</span><span class="p">,</span><span class="s2">"urlPop"</span><span class="p">,</span><span class="s2">"auctionDurations"</span><span class="p">,</span><span class="s2">"meanURLPopularity"</span><span class="p">,</span><span class="s2">"meanIPPopularity"</span><span class="p">,</span><span class="s2">"meanCountryPopularity"</span><span class="p">,</span><span class="s2">"meanDevicePopularity"</span><span class="p">,</span><span class="s2">"countryPop"</span><span class="p">,</span><span class="s2">"auctionPop"</span><span class="p">,</span><span class="s2">"devicePop"</span><span class="p">,</span><span class="s2">"meanAuctionPopularity"</span><span class="p">,</span><span class="s2">"ipsPerdevice.m"</span><span class="p">,</span><span class="s2">"auctionsPerdevice.m"</span><span class="p">,</span><span class="s2">"auctionsPercountry.m"</span><span class="p">,</span><span class="s2">"urlsPerdevice.m"</span><span class="p">)</span></code></pre></figure>
<h2 id="lessons-learned-and-things-to-try-next-time">Lessons learned and things to try next time</h2>
<p>Here are some of the key things I learned from this competition, or things I might do differently next time:</p>
<ul>
<li>The private leaderboard can be misleading; next time I will conduct more rigorous testing using local cross validation assessments rather than ‘trusting’ the public leaderboard.</li>
<li>Feature engineering is far more important than model selection or parameter tuning (beyond a certain point). Next time I’ll focus more on feature extraction and having a clear structure around my feature extraction/variable selection process.</li>
<li>Upon looking at some of the better-scoring participants solutions, I think it’s easy to see why I came 17th, and not higher. Their features were just a bit more logical/clever in terms of being able to pick out the bots. The structure of their feature extraction was also clearer.</li>
<li>Oversampling to address class imbalances can improve accuracy (at least in an ROC AUC sense).</li>
<li>Next time I’ll save all my plots as I go so I can include some more pretty pictures in a write-up like this!</li>
</ul>
<h3 id="thanks-for-reading">Thanks for reading!</h3>liam schoneveldThe ‘Facebook Recruiting IV: Human or bot?’ competition just ended on Kaggle. For those unfamiliar with the competition, participants downloaded a table of about 7 million bids, which corresponded to another table of around 6,000 bidders. For 4,000 of those bidders, you had to estimate the probability that they were a human or bot, based on the remaining 2,000 bidders, whose bot status was given.