The BlogThe blog of Niels-Ole Kühl
https://www.niels-ole.com/
Sun, 28 Oct 2018 17:24:40 +0100Sun, 28 Oct 2018 17:24:40 +0100Jekyll v3.8.4100ms in additional latency cost you 1 % revenue, don't they?<p>When writing my master thesis about resource allocation in containers I wanted to show the relevance of performance
by citing something I always knew to be true: <strong>100ms in additional latency costs you 1 % revenue.</strong> Time to find a source for the references!</p>
<p>All sources I could find eventually end up to be Greg Linden. He worked at Amazon for 5 years from 1997 to 2002 and worked on the recommendation system.
He recounts this stat in a <a href="http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html">blog post</a> from 2006:
<strong>In A/B tests, we tried delaying the page in increments of 100 milliseconds and found that even very small delays would result in substantial and costly drops in revenue.</strong></p>
<p>Later that month he repeated this in <a href="https://web.archive.org/web/20081117195303/http://home.blarg.net/~glinden/StanfordDataMining.2006-11-29.ppt">a presentation</a>(PPT):</p>
<p><img src="/assets/greg-linden-slide.png" alt="+100ms latency leads to -1% sales at amazon" /></p>
<p>This A/B test must have happened somewhen before 2002.
People do statistically insignificant A/B tests or come to wrong conclusions all the time, so a simple claim is not good enough for a source.</p>
<p>Greg Linden referenced a <a href="http://conferences.oreillynet.com/presentations/web2con06/mayer.ppt">presentation by Marissa Meyer</a>(PPT)
from the Web 2.0 conference in 2006 in which
she apparently talked about the A/B test from the slide above. Her main message: Instant feedback loops matter for product engagement.</p>
<p>I am rather surprised to see that the 1% claim has made it into the common knowledge of so many people as a fact, considering it is so weakly supported and outdated.</p>
<p>Discuss: <a href="https://news.ycombinator.com/item?id=18317170">Hacker News</a> <a href="https://www.reddit.com/r/perfmatters/comments/9rws7x/100ms_in_additional_latency_cost_you_1_revenue/">Reddit</a></p>
Sat, 27 Oct 2018 14:40:00 +0200
https://www.niels-ole.com/amazon/performance/2018/10/27/100ms-latency-1percent-revenue.html
https://www.niels-ole.com/amazon/performance/2018/10/27/100ms-latency-1percent-revenue.htmlamazonperformanceQuickStart for Let's Encrypt on Kubernetes<p>This post will show you how to use <a href="https://github.com/jetstack/cert-manager">cert-manager</a> to automatically create and use certificates with Let’s Encrypt on Kubernetes.
This is especially useful if you are looking for a successor to kube-lego, which is no longer maintained.
Take a look at the <a href="https://cert-manager.readthedocs.io/en/latest/getting-started/index.html">offical docs</a>, if you want more information about how each component works.</p>
<p>Prerequisites for this guide:</p>
<ul>
<li>Running kubernetes cluster
<ul>
<li><a href="https://kubernetes.github.io/ingress-nginx/deploy/">Nginx Ingress Controller installed</a></li>
<li><a href="https://docs.helm.sh/using_helm/">Helm installed</a></li>
</ul>
</li>
<li>DNS entries pointed towards the node running your ingress controller</li>
</ul>
<p>First we need to install cert-manager with helm</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>helm <span class="nb">install</span> <span class="nt">--name</span> cert-manager <span class="nt">--namespace</span> kube-system stable/cert-manager
</code></pre></div></div>
<p>Install clusterissuers, which instruct Cert Manager to use Let’s Encrypt. Replace the email with yours</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">certmanager.k8s.io/v1alpha1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ClusterIssuer</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">letsencrypt-prod</span>
<span class="na">namespace</span><span class="pi">:</span> <span class="s">default</span>
<span class="na">spec</span><span class="pi">:</span>
<span class="na">acme</span><span class="pi">:</span>
<span class="na">email</span><span class="pi">:</span> <span class="s">youremail@example.com</span>
<span class="na">server</span><span class="pi">:</span> <span class="s">https://acme-v01.api.letsencrypt.org/directory</span>
<span class="na">privateKeySecretRef</span><span class="pi">:</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">letsencrypt-prod</span>
<span class="na">http01</span><span class="pi">:</span> <span class="pi">{}</span>
</code></pre></div></div>
<p>Enable usage of the issuer we just created:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>helm upgrade cert-manager stable/cert-manager <span class="nt">--namespace</span> kube-system <span class="nt">--set</span> ingressShim.defaultIssuerName<span class="o">=</span>letsencrypt-prod <span class="nt">--set</span> ingressShim.defaultIssuerKind<span class="o">=</span>ClusterIssuer
</code></pre></div></div>
<p>Create ingress resources for your services with <code class="highlighter-rouge">kubectl apply -f</code> e.g.:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">extensions/v1beta1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Ingress</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">www-ingress</span>
<span class="na">namespace</span><span class="pi">:</span> <span class="s">default</span>
<span class="na">annotations</span><span class="pi">:</span>
<span class="c1"># This needs to be set to enable automatic certificates</span>
<span class="s">kubernetes.io/tls-acme</span><span class="pi">:</span> <span class="s2">"</span><span class="s">true"</span>
<span class="s">kubernetes.io/ingress.class</span><span class="pi">:</span> <span class="s2">"</span><span class="s">nginx"</span>
<span class="na">spec</span><span class="pi">:</span>
<span class="na">tls</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">www.example.com</span>
<span class="c1"># The secret will be created automatically</span>
<span class="na">secretName</span><span class="pi">:</span> <span class="s">www-tls</span>
<span class="na">rules</span><span class="pi">:</span>
<span class="c1"># The host must be identical to the above one</span>
<span class="pi">-</span> <span class="na">host</span><span class="pi">:</span> <span class="s">www.example.com</span>
<span class="na">http</span><span class="pi">:</span>
<span class="na">paths</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">path</span><span class="pi">:</span> <span class="s">/</span>
<span class="na">backend</span><span class="pi">:</span>
<span class="c1"># The name of your service</span>
<span class="na">serviceName</span><span class="pi">:</span> <span class="s">www-service</span>
<span class="na">servicePort</span><span class="pi">:</span> <span class="s">80</span>
</code></pre></div></div>
<p>This is the minimal configuration that gives you SSL. Your website should no be accessible via https at <code class="highlighter-rouge">www.example.com</code>.</p>
<p><strong>Working with authentication</strong></p>
<p>If you use <code class="highlighter-rouge">nginx.ingress.kubernetes.io/auth-*</code> annotations you will need to whitelist the ACME challenge location in order to succeed in proving
that you operate the website to Let’s Encrypt. If you set up Nginx ingress correctly there should be a configmap for configuring nginx.
Modify it (by e.g. running <code class="highlighter-rouge">EDITOR=nano kubectl edit cm -n ingress-nginx nginx-configuration</code>) to add the annotation <code class="highlighter-rouge">no-auth-locations</code> so that it looks similar to this:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">data</span><span class="pi">:</span>
<span class="na">no-auth-locations</span><span class="pi">:</span> <span class="s">/.well-known/acme-challenge</span>
<span class="c1"># ...</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ConfigMap</span>
<span class="na">metadata</span><span class="pi">:</span>
<span class="na">annotations</span><span class="pi">:</span>
<span class="c1"># ...</span>
<span class="na">labels</span><span class="pi">:</span>
<span class="na">app</span><span class="pi">:</span> <span class="s">ingress-nginx</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">nginx-configuration</span>
<span class="na">namespace</span><span class="pi">:</span> <span class="s">ingress-nginx</span>
<span class="c1"># ...</span>
</code></pre></div></div>
Thu, 17 May 2018 14:40:00 +0200
https://www.niels-ole.com/letsencrypt/cert-manager/nginx-ingress/kubernetes/2018/05/17/letsencrypt-kubernetes.html
https://www.niels-ole.com/letsencrypt/cert-manager/nginx-ingress/kubernetes/2018/05/17/letsencrypt-kubernetes.htmlletsencryptcert-managernginx-ingresskubernetesA fork on Github is no fork<p><strong>Github may block access to your repos and there is nothing you can do about it</strong></p>
<p>A few years ago I made a project with a friend and we collaborated on Github in his private repo. After we finished the project I forked it, to be able to still access it independently from him.</p>
<p>While I still have unlimited private repos, my friend let his Premium Account (Student) expire. The original repo is now inaccessible. This is expected.</p>
<p>When I recently needed a code snippet from that project I visited my fork and was greeted with:</p>
<p><img src="/assets/github-blocked.png" alt="Screenshot: My fork is inaccessible" /></p>
<p><strong>My</strong> fork is inaccessible to me because the <strong>upstream</strong> repo was disabled. Judging from the wording “root repository” this is no mistake.
If I wanted to access the repo, I needed to convince my friend to buy premium.</p>
<p>Maybe I just misinterpreted how a fork works? Let’s look at the <a href="https://help.github.com/articles/fork-a-repo/">documentation</a>:</p>
<blockquote>
<p>A <em>fork</em> is a copy of a repository.</p>
</blockquote>
<p>No mistake there, I should have access to my fork, because it is a <strong>copy</strong> not just a reference. Probably I have recourse against Github as I legititimetly should have access to that repo?</p>
<p>From their <a href="https://help.github.com/articles/github-terms-of-service/#3-github-may-terminate">terms of services</a>:</p>
<blockquote>
<p>GitHub has the right to suspend or terminate your access to all or any part of the Website at any time, with or without cause, with or without notice, effective immediately. GitHub reserves the right to refuse service to anyone for any reason at any time.</p>
</blockquote>
<p>Not so much.</p>
<p>Luckily I found an old local copy of my project, but this taught me not to rely on Github as only storage for code.
I haven’t tried contacting customer support, but as this appears to be official policy I would not expect a change there.</p>
<p><strong>UPDATE:</strong> Someone pointed me to this <a href="https://help.github.com/articles/what-happens-to-forks-when-a-repository-is-deleted-or-changes-visibility/">article page</a> so it actually is documented, just not on that page. And also on <a href="https://help.github.com/articles/about-forks/">another page</a> there is mention of this:</p>
<blockquote>
<p>Private forks inherit the permissions structure of the upstream or parent repository</p>
</blockquote>
<p>While this decision has a reasoning (“This helps owners of private repositories maintain control over their code.”)
I think it is a strawman argument, because as long as you can clone a repo an owner of a repo has no real control over the distribution of the code and the current behaviour is just user hostile.</p>
<p>Discuss on <a href="https://news.ycombinator.com/item?id=16600219">Hacker News</a></p>
Fri, 16 Mar 2018 14:15:00 +0100
https://www.niels-ole.com/ownership/2018/03/16/github-forks.html
https://www.niels-ole.com/ownership/2018/03/16/github-forks.htmlownershipMachine Intelligence I - Learning Notes<p>This semester learning notes are about supervised learning. Previous semester on <a href="/machine/learning/intelligence/unsupervised/2017/09/20/unsupervised-methods.html">unsupervised learning</a></p>
<h1 id="performance-measurement"><a href="/machine/learning/intelligence/supervised/2018/02/17/performance-measures.html">Performance Measurement</a></h1>
<h1 id="neural-networks"><a href="/machine/learning/intelligence/supervised/2018/02/17/neural-networks.html">Neural Networks</a></h1>
<h1 id="support-vector-machines"><a href="/machine/learning/intelligence/supervised/2018/02/17/support-vector-machines.html">Support Vector Machines</a></h1>
<h1 id="bayesian-networks"><a href="/machine/learning/intelligence/supervised/2018/01/11/bayesian-networks.html">Bayesian Networks</a></h1>
<h1 id="reinforcement-learning"><a href="/machine/learning/intelligence/supervised/2018/02/26/reinforcement-learning.html">Reinforcement Learning</a></h1>
<p><em><a href="/machine/learning/intelligence/supervised/2018/01/11/statistical-learning-theory.html">Statistical Learning Theory</a></em>(Excluded from the exam and therefore neglected here)</p>
Mon, 26 Feb 2018 15:31:10 +0100
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/02/26/mi1-index.html
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/02/26/mi1-index.htmlmachinelearningintelligencesupervisedReinforcement Learning<p>These are exam preparation notes, subpar in quality and certainly not of divine quality.</p>
<p>See the <a href="/machine/learning/intelligence/supervised/2018/02/26/mi1-index.html">index with all articles in this series</a></p>
<hr />
<p>In reinforcement learning an actor is in a world where she can perform different actions and perceive the environment.
Sometimes there may be rewards. Reinforcement learning is about choosing a policy from which to derive actions that maximize the reward.
Just like the real world there are a lot of rewards that require a lot of foresight.</p>
<p>The data is provided in triplets of</p>
<script type="math/tex; mode=display">(\underline x , \underline a, \underline r)</script>
<h2 id="markov-decision-process-mdp">Markov Decision Process (MDP)</h2>
<p>In a MDP the actor is in a discrete state and has a discrete set of possible actions.
Also there is a transition model (just because you picked an action, does not mean you deterministically land in a specific state)
as well as a reward function (usually based on an action in a specific state, can be nondeterministically, too).</p>
<h2 id="policy">Policy</h2>
<p>The policy sets how the agent behaves. Often given as probability of the state:</p>
<script type="math/tex; mode=display">\pi(\underline a_k | \underline x_i)</script>
<h2 id="markov-chain">Markov chain</h2>
<p>A <strong>markov chain</strong> is a sequence of states and actions, where the next state is drawn from transition model (see above).</p>
<p>In a markov chain all relevant information is expressed in the current state and there is no need to look back into the past.</p>
<p>A markov chain can also be expressed as a bipartite tree.</p>
<p><strong>Ergodicity</strong> is when a markov chain may return from any state to that state aperiodically.</p>
<h2 id="policy-evaluation">Policy Evaluation</h2>
<p>A <strong>value function</strong> is an estimate of the expected value of the policy in the <em>initial</em> state. You could determine it at
any other state, but the initial one makes most sense for policy evaluation.</p>
<p>It is an average over the markov chain, usually with a discount factor for future rewards. Unfortunately this is inefficient.</p>
<h2 id="bellman-equation">Bellman equation</h2>
<p>See <a href="https://www.tu-chemnitz.de/informatik/KI/scripts/ws0910/ml09_3.pdf">slide 29</a>
As the value function contains the expected future discounted rewards, the value function looks a little like this:</p>
<script type="math/tex; mode=display">V(s_t) = E_\pi\{ R _t | s_t = s\}\\
= E_\pi\{r_{t+1}+\gamma V(s_{t+1})| s_t = s\}</script>
<h2 id="model-based-vs-model-free-approaches">Model-based vs model-free approaches</h2>
<p>In model based learning the model consists of the immediate reward <script type="math/tex">\underline r ^\pi</script> and the probability for the transition to the next state <script type="math/tex">\underline P ^\pi</script>.</p>
<p>These two models need to be established by the actor playing in the environment, visting each state infinitely often.</p>
<p>This leads to this value iteration function:</p>
<script type="math/tex; mode=display">\overset {~}{\underline v}^{\pi(t+1)} = \underline r^\pi + \gamma \underline P^\pi \overset {~}{\underline v}^{\pi(t)}</script>
<h3 id="temporal-differencetd-learning">Temporal Difference(TD) learning</h3>
<p>In reality these models of <script type="math/tex">\underline r ^\pi</script> and <script type="math/tex">\underline P ^\pi</script> are hard to come by.
Often the estimation of the policy evaluation has to happen online (also to reduce computational storage complexity???).</p>
<p>For this reason temporal difference learning is used, which keeps no model and only uses the immediate reward.</p>
<script type="math/tex; mode=display">\overset {~}{\underline V}^{\pi}_{t+1}(\underline x ^{(t)}) =
\overset {~}{\underline V}^{\pi}_{t}(\underline x ^{(t)})
+ \eta (r_t + \gamma
\overset {~}{\underline V}^{\pi}_{t}(\underline x ^{(t+1)})-\overset {~}{\underline V}^{\pi}_{t}(\underline x ^{(t)}))</script>
Mon, 26 Feb 2018 15:30:10 +0100
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/02/26/reinforcement-learning.html
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/02/26/reinforcement-learning.htmlmachinelearningintelligencesupervisedSupport Vector Machines<p>These are exam preparation notes, subpar in quality and certainly not of divine quality.</p>
<p>See the <a href="/machine/learning/intelligence/supervised/2018/02/26/mi1-index.html">index with all articles in this series</a></p>
<hr />
<p><em>If you are stuck, read <a href="https://en.wikipedia.org/wiki/Support_vector_machine#Hard-margin">Wikipedia</a> in parallel.</em></p>
<p>The goal of SVMs is to divide two groups with a line that separates the data points as clearly as possible.</p>
<p>There are two cases:</p>
<ul>
<li>Data points can be cleanly split into their classes</li>
<li>At least some data points overlap making it impossible to lay a line between datapoints.</li>
</ul>
<h2 id="clean-split">Clean split</h2>
<p>With an SVM you want to make a binary classification of points and predict class membership.</p>
<p><img src="/assets/Svm_separating_hyperplanes.svg" alt="two point clouds separated by multiple candidate lines" /></p>
<p><em>From ZackWeinberg <a href="https://en.wikipedia.org/wiki/Support_vector_machine#/media/File:Svm_separating_hyperplanes_(SVG).svg">Wikipedia</a> CC BY_SA 3.0</em></p>
<p>The label is assigned by calculating on which side of a hyperplane a point lies.</p>
<script type="math/tex; mode=display">y = sign(\underline w^T \underline x + b)</script>
<p>Where <script type="math/tex">\underline w</script> and <script type="math/tex">b</script> are your set of parameters for a linear connectionist neuron.</p>
<p>Our goal is to find a line that cuts as “clean” as possible throught the two groups. Finding lines that separate the two groups</p>
<script type="math/tex; mode=display">\underline w^T \underline x + b = 0</script>
<p>is ambiguous as can be seen in the graphic above (H2 and H3). In this graphic H3 is the optimal solution.
How do we maximize the distance to the two point clouds?</p>
<script type="math/tex; mode=display">\underset{\alpha = 1,...,p}{min} | \underline w ^T \underline x^{(\alpha)}+b| \overset{!}{=}1</script>
<p>What we thereby do is maximize the distance of the split line to the closest points. This is very prone to outliers, as the
closest values determine the parameters of the line.</p>
<h3 id="optimization-problems">Optimization problems</h3>
<ul>
<li>
<p>Primal</p>
<script type="math/tex; mode=display">d_w = \frac {1}{||\underline w || } \overset {!}{=} max</script>
</li>
<li>
<p>Dual: Provides lower bound to the solution of the primal problem</p>
<script type="math/tex; mode=display">TODO</script>
</li>
</ul>
<h2 id="unclean-split---c-svm">Unclean split - C-SVM</h2>
<p>When we cannot put a straight line through the data points, we cannot only optimize for the above, but need to add a regularization term. (Kind of adding prior knowledge)</p>
<script type="math/tex; mode=display">\frac{1}{2}||\underline w || ^2 + \frac {C}{p}\sum^p_{\alpha=1}\varphi_\alpha \overset{!}{=}min</script>
<p>with</p>
<ul>
<li><em>C</em> regularization parameter</li>
<li><script type="math/tex">\varphi</script> (squared???) error
<h2 id="kernels">Kernels</h2>
</li>
</ul>
<p>Kernels are often used to identify more complex shapes.
Typical kernel functions are:</p>
<ul>
<li>
<p>polynomial of degree d</p>
</li>
<li>
<p>RBF kernel with range <script type="math/tex">\sigma</script></p>
</li>
<li>
<p>neural network (excluded from exam)</p>
</li>
<li>
<p>plummer kernel (excluded from exam)</p>
</li>
</ul>
<h2 id="multiclass-classification">Multiclass classification</h2>
<p>As SVMs were designed as binary classifiers, they cannot be directly used for multiclass classification.</p>
<p>What is usually done is combining multiple binary classifiers via a couple of perceptrons.</p>
Sat, 17 Feb 2018 15:30:10 +0100
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/02/17/support-vector-machines.html
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/02/17/support-vector-machines.htmlmachinelearningintelligencesupervisedPerformance measures<p>These are exam preparation notes, subpar in quality and certainly not of divine quality.</p>
<p>See the <a href="/machine/learning/intelligence/supervised/2018/02/26/mi1-index.html">index with all articles in this series</a></p>
<h2 id="choice-of-error-function">Choice of error function</h2>
<p>Usually squared error is used.</p>
<h2 id="cross-entropy">Cross-Entropy</h2>
<p><script type="math/tex">c</script> := different classes (classification / symbol representation)</p>
<p>From <a href="https://en.wikipedia.org/wiki/Cross_entropy">Wikipedia</a>:</p>
<blockquote>
<p>In information theory, the cross entropy between two probability distributions <em>p</em> and <em>q</em> over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution <em>q</em>, rather than the “true” distribution <em>p</em>.</p>
</blockquote>
<p>The delta between Cross-Entropy and True entropy can be measured by The <a href="/machine/learning/intelligence/unsupervised/2017/09/20/unsupervised-methods.html">KL-divergence</a></p>
<p>We have the problem that we do not know the true distribution, and also not
<script type="math/tex">P(y_T|\underline x)</script>
. (After all that is what we try to figure out in the first place)</p>
<script type="math/tex; mode=display">E^G \equiv -\sum^c_{k=1}\int{d\underline x}\underset{unknown}{P_{(\underline x )}P_{C_k|\underline x )}} ln( P_{C_k|\underline x ; \underline w)})</script>
<p>Our expectation is that the following is true:</p>
<script type="math/tex; mode=display">\hat{E}^G = E^T = \frac{1}{q}\sum^q_{\beta=1}e^{(\beta)}</script>
<p>where <script type="math/tex">e^{(\beta)}</script> is already the squared error.</p>
<h2 id="empirical-risk-minimization-erm">Empirical Risk Minimization (ERM)</h2>
<p>Based on what we learned from Cross-Entropy, to reduce <script type="math/tex">E_G</script>, we need to reduce <script type="math/tex">E_T</script>.</p>
<h2 id="overfitting">Overfitting</h2>
<p>If your model complexity exceeds the complexity of the data, the only thing you start to fit to is the noise in the data.</p>
<p>In general you want to minimize the generalization error <script type="math/tex">E_G</script> which is usually approximated using a test error <script type="math/tex">E_T</script>.
In order to not overfit, you can have a separate validation dataset. or do things like cross validation.</p>
<p>Overfitting <–> Underfitting</p>
<h2 id="cross-validation">Cross Validation</h2>
<p>Cross Validation is usually done in an n-fold way. That means that the data is dividided into random similarly sized buckets.
The training is then done on all but one buckets (e.g. 9/10) and the testing on the remaining one (1/10). This is done for all combinations.</p>
<p>Averaging the 10 errors, it is possible to estimate the generalization error without leaving training data out.
The final model is then often trained on the entire data set.</p>
<p>Aka <em>Resampling</em>.</p>
<h2 id="bias-variance-trade-off">Bias Variance Trade-Off</h2>
<p><a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">Good Explanation of concept</a></p>
<p>The data points are the underlying model + noise.</p>
<script type="math/tex; mode=display">y_T = y^*_{(\underline x)} + \eta</script>
<script type="math/tex; mode=display">Err(x)=\text{Bias}^2+\text{Variance}+\text{Irreducible Error}</script>
<h2 id="stochastic-approximation-online-learning--gradient-descent">Stochastic approximation Online learning / Gradient Descent</h2>
<script type="math/tex; mode=display">\Delta w^{v'v}_{ij} = -\eta \frac{\partial E ^T _{[\underline w]}}{\partial w ^{v'v}_{ij}} =
-\eta \frac{1}{p}\sum^p_{\alpha=1}
{\frac{\partial e ^{(\alpha)} _{[\underline w]}}{\partial w ^{v'v}_{ij}}}</script>
<p>Where <script type="math/tex">e^{(\alpha)}</script> is the individual (squared) error for the data point <script type="math/tex">x^{(\alpha)}</script>.</p>
<p>Often gradient descent is done in mini batches.</p>
<h2 id="improvements">Improvements</h2>
<p>Adapt step size: Big step size, when error decreases, small step size when error increases.</p>
<h2 id="regularization">Regularization</h2>
<script type="math/tex; mode=display">R_{[\underline w]}=E^T _{[\underline w]} + \lambda E^R_{[\underline w]} \overset{!}{=} min</script>
<p>The <script type="math/tex">E^R</script> brings some prior knowledge into the solution.</p>
<p>To paraphrase this, the <script type="math/tex">E^R</script> pulls the solution into a direction that will likely reduce regularization error (or otherwise reduce the risk of errors)</p>
<p><strong>Forms</strong>:</p>
<ul>
<li>Weight decay
<!--$$
E^R_{[\underline w ]}=\frac{1}{2} \sum_{i,j,v,v'} (w_{ij}^{vv'})^2
$$-->
always subtract a part of your <script type="math/tex">\lambda w</script> from your w</li>
<li>Different Norms (0.1,.5,1,2,4)
General form:
<script type="math/tex">E^R=\sum_{ijv'v} | w_{ij}^{v'v}|^q</script></li>
</ul>
<h2 id="parametric-vs-nonparametric-algorithms">Parametric vs. Nonparametric Algorithms</h2>
<p><a href="https://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/">Adapted from</a> Parametric Algorithms have a fixed number of parameters (and therefore model complexity). Examples:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Logistic Regression
Linear Discriminant Analysis
Perceptron
Naive Bayes
Simple Neural Networks
</code></pre></div></div>
<p>Nonparametric algorithms make few assumptions about the underlying data. Examples:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>k-Nearest Neighbors
Decision Trees like CART and C4.5
Support Vector Machines
</code></pre></div></div>
Sat, 17 Feb 2018 15:30:10 +0100
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/02/17/performance-measures.html
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/02/17/performance-measures.htmlmachinelearningintelligencesupervisedNeural Networks<p>These are exam preparation notes, subpar in quality and certainly not of divine quality.</p>
<p>See the <a href="/machine/learning/intelligence/supervised/2018/02/26/mi1-index.html">index with all articles in this series</a></p>
<h2 id="connectionist-neurons">Connectionist Neurons</h2>
<p>A neural network generally has a number of inputs <script type="math/tex">x_1...x_N</script> which are aggregated into <script type="math/tex">\underline x</script>.
At each node there is a transfer function <script type="math/tex">y_i</script> which turns the inputs according to weights <script type="math/tex">\underline w</script> into its own output.</p>
<p>A typical function would look like this. The part in the brackets is later referred to as <script type="math/tex">h</script></p>
<script type="math/tex; mode=display">y_i = f(\sum_j {w_{ij}x_j-\theta_i})</script>
<p>Some of the functions being used for <script type="math/tex">f</script>:</p>
<p><strong>Logistic Function</strong>:</p>
<script type="math/tex; mode=display">f{(h)}=\frac{1}{1+exp(-\beta h}</script>
<p><strong>Hyperbolic tangent</strong>:</p>
<script type="math/tex; mode=display">f{(h)}=tanh(\beta h)</script>
<p><strong>Linear neuron</strong>:</p>
<script type="math/tex; mode=display">f(h)=\beta h</script>
<p><strong>Binary neuron</strong>:</p>
<script type="math/tex; mode=display">f(h)=sign(h)</script>
<p>The <script type="math/tex">\beta</script> is a slope parameter.</p>
<p><strong>TODO</strong>:Transformation</p>
<script type="math/tex; mode=display">\frac{1}{2}(tanh\frac{1}{2} + 1)</script>
<p>The first weight or input is a <a href="https://stats.stackexchange.com/questions/185911/why-are-bias-nodes-used-in-neural-networks">bias node</a> which is always 1. It is not always included in equations as <script type="math/tex">w</script> or <script type="math/tex">x</script>. <em>The bias node could be seen as the length of <strong>w</strong>(???)</em></p>
<p>Reasons for nonlinear transfer functions:</p>
<ul>
<li>multiple layers could beexpressed as one in linear transfer functions (main reason)</li>
<li>sign function for classification problems (0,1)</li>
<li>logistic sigmoidal for probabilities (0..1)</li>
</ul>
<p>Important variables:</p>
<p><script type="math/tex">\theta</script> threshold<br />
<script type="math/tex">v \text{ and } v'</script> layer</p>
<h2 id="types-of-neural-networks">Types of Neural Networks</h2>
<ul>
<li>Recurrent Neural Networks<br />
There can be loops in the graph</li>
<li>Feedforward Neural Networks (DAG)<br />
no loops</li>
<li>Radial Basis Function Networks</li>
</ul>
<p><strong>Typical Usecase</strong>: Prediction of attributes. MLPs are universal approximators</p>
<p>Always from <script type="math/tex">\mathbb{R}^N</script> to anything, really.</p>
<h2 id="nn-in-regression">NN in Regression</h2>
<p><a href="https://en.wikipedia.org/wiki/Hessian_matrix">Hessian matrix</a> is <strong>second</strong> derivative of <script type="math/tex">R^N\mapsto R</script></p>
<p><a href="/machine/learning/intelligence/unsupervised/2017/09/20/unsupervised-methods.html#jacobian-matrix">Jacobian matrix</a> is <strong>first</strong> derivative of <script type="math/tex">R^{N}\mapsto R^M</script>.</p>
<p>Hessian is often too computationally expensive to compute and therefore backpropagation is often used instead of <a href="https://en.wikipedia.org/wiki/Newton%27s_method">Newton’s Method</a>.</p>
<h2 id="generalization-error">Generalization Error</h2>
<p>ERM</p>
<h2 id="test-error">Test Error</h2>
<p>The test error is reduced using gradient descend.</p>
<script type="math/tex; mode=display">w_{ij}^{v'v}(t+1)=w_{ij}^{v'v}(t) - \hat\eta \frac{\partial E}{\partial w}w_{ij}^{v'v}</script>
<p>where</p>
<script type="math/tex; mode=display">\frac{\partial E}{\partial w}w_{ij}^{v'v} = \frac{1}{p}\sum^p_{\alpha=1}{\frac{\partial e^{(\alpha)}_{[\underline w]}}{\partial w_{ij}^{v'v}}}</script>
<p>The error is usually quadratic error</p>
<script type="math/tex; mode=display">e(y_T,\underline x) = \frac{1}{2}(y_T-y(\underline x))^2</script>
<p>The derivative is trivially:</p>
<script type="math/tex; mode=display">\frac{\partial e^{(\alpha)}}{\partial y_(\underline x ^{(\alpha)}, \underline w )} = y_{(\underline x^{(\alpha)}}-y_T</script>
<p>and is later used in backpropagation.</p>
<h2 id="backpropagation">Backpropagation</h2>
<p>In backpropagation the weights of the neural network are adjusted so that the test error is reduced. This is achieved by</p>
<ul>
<li>Calculating the prediction</li>
<li>Calculating the test error</li>
<li>Going back layer by layer and calculating the delta each time
*</li>
</ul>
<p>It would be possible to do backpropagation by applying the chain rule. But that is a lot more computationally expensive than Backpropagation.</p>
<h2 id="regularization-in-deep-learning">Regularization in Deep Learning</h2>
<ul>
<li>Dropout randomly ignores neurons</li>
</ul>
<h2 id="architectures">Architectures</h2>
<h3 id="convolutional-layer">Convolutional layer</h3>
<p>Layer that is only connected to selected previous neurons. For example this can be used in image recognition, having neurons only
be connected to some adjacent previous pixels ( a tensor).</p>
<h3 id="spatialfeature-pooling">Spatial/Feature pooling</h3>
<p>Trying to detect features in an image even though the image may be rotated, translated, etc.
There are then e.g. three different detection units for a specific pattern that is then aggregated by a neuron with a <em>max()</em> function, to recognize the correctly oriented feature.</p>
<h3 id="auto-encoders">Auto-Encoders</h3>
<p><em>Unfortunately excluded from the exam, therefore neglected here</em></p>
<p>Basically you take an image of what you want to recognize and push it through your network.
What you get is a “compressed” version of the image (there is a lot less information in the final layers). In the beginning of your training this will be just noise / randomness.
You then have another neural network (the same???) reconstruct the original image.</p>
<p>What is then possible is to compare the reconstruction to the original image and generate error values from it.</p>
<p>You thereby can train two neural networks to meaningfully abstract from images without having to have labelled images.</p>
<h2 id="time-series">Time Series</h2>
<p>In a time series it is often assumed that <em>y</em> depends on a short time window. Therefore there are convolutions, where some neurons can look “back” in time.</p>
<h3 id="recurrent-nn">Recurrent NN</h3>
<p>Neural Network is “shifted” through time.
All previous inputs are summarized as a vector with a weight vector <script type="math/tex">W</script> containing th mapping on itself.</p>
<p><em>n</em> number of timesteps</p>
<p>Cost function with:</p>
<script type="math/tex; mode=display">E^T = \frac{1}{p}\sum^p_{\alpha = 1}(\frac{1}{n_\alpha}\sum^n_{t=1}e^{(\alpha,t)})</script>
<p>There is a vector <script type="math/tex">\underline W</script> which contains the weights that measure how much of the previous input should
be considered in the next timestep.</p>
<p>TODO: Are the weights <script type="math/tex">\underline W</script> different for each timestep?</p>
<h4 id="backpropagation-through-time">Backpropagation through time</h4>
<p>Works just like regular backpropagation.</p>
<ul>
<li>Assume all <script type="math/tex">\underline W^{(t)}</script> are independent</li>
<li>compute gradients with backpropagation</li>
<li>All computed gradients are averaged for weight update.</li>
</ul>
<script type="math/tex; mode=display">\Delta \underline W = -\eta \frac{\partial E^T}{\partial \underline W}</script>
<h4 id="exploding--vanishing-gradient">Exploding / Vanishing gradient</h4>
<script type="math/tex; mode=display">\underline W = \underline U \underline \Lambda \underline U ^\intercal</script>
<p>One problem of RNNs is that activity is often either vanishing or exploding over time, when
<script type="math/tex">|y_i|\neq 0</script>
.</p>
<p><strong>Echo State Networks</strong></p>
<table>
<tbody>
<tr>
<td>Echo state networks set W and U so that their</td>
<td>y_i</td>
<td>is almost equal to r. (TODO why is r in range 1.3<->3 ?)</td>
</tr>
</tbody>
</table>
<p><strong>Leaky Units</strong>:</p>
<p>There are units that specialize in long or short term memory. This depends on a factor <script type="math/tex">\alpha</script></p>
<p><strong>LSTM</strong></p>
<ul>
<li>Delay update of hidden layer</li>
<li>Special transfer function (only retrieve state in certain cases)</li>
</ul>
<h2 id="radial-basis-function-networks">Radial Basis Function Networks</h2>
<p>Also see <a href="https://en.wikipedia.org/wiki/Radial_basis_function_network">Wikipedia</a>.</p>
<p><img src="/assets/Radial_funktion_network.svg" alt="3-layered Radial basis function network" /></p>
<p>A radial basis function is a function that is only dependent on the distance from the center(Usually Eucleadian distance).</p>
<script type="math/tex; mode=display">\phi_i(\underline x) = \overset{~}{\phi_i}(D[\underline x, \underline t_i])</script>
<p>Gaussian function often used:</p>
<script type="math/tex; mode=display">\phi_i(\underline x) = exp(-\frac{||\underline x -\underline t_i||^2}{2\sigma^2_i}))</script>
<h3 id="learning-with-rbfs">Learning with RBFs</h3>
<p>Three different parameters:</p>
<ul>
<li><script type="math/tex">\underline t_i</script> centroid (center of basis function)</li>
<li><script type="math/tex">\sigma_i</script> range of influence</li>
<li><script type="math/tex">w_i</script> weights of the output layer</li>
</ul>
<p>2-Step Learning procedure is an alternative to normal learning of parameters.</p>
<ul>
<li>Find centroids and variances <script type="math/tex">\sigma_i</script></li>
<li>Determine output weights <script type="math/tex">underline w_i</script></li>
</ul>
<p><strong>Find centroids and variances</strong></p>
<p>Use k-means clustering to find centroids</p>
<p>Choose <script type="math/tex">\sigma_i</script> so that it is double the distance of the closest two centroids.</p>
<script type="math/tex; mode=display">\sigma_i= \lambda \underset {j\neq i}{min} ||\underline t _i -\underline t_j||, \lambda \approx 2</script>
<p><strong>Determine output weights</strong></p>
<p>Output weights are found reducing quadratic error with <em>M</em> := number of RBFs:</p>
<script type="math/tex; mode=display">E^T = \frac{1}{2p}\sum^p_{\alpha=1}(y_t^{(\alpha)} -\sum^M_{i=1} (w_i\phi_{i(\underline x^{(\alpha)})}))^2</script>
<p>Pseudo-inverse</p>
<script type="math/tex; mode=display">(\underline \Phi^T \underline \Phi ) \underline w = \underline \Phi ^T \underline y _T \implies
\underline w = (\underline \Phi ^T \underline \Phi )^{-1} \underline \Phi ^T \underline y _T</script>
<p>TODO: Do we now use Gradient Descent or invertible matrix?</p>
<h3 id="mlp-vs-rbf">MLP vs RBF</h3>
<p>RBFs have fast convergence, as few parameters needs to be changed per training point, as they have negligible influence on far away points.</p>
<p>RBFs fall under curse of dimensionality, need <script type="math/tex">n^d</script> basis functions. (<em>n</em> number of data points along one dimension, <em>d</em> number of dimensions)</p>
<p>RBFs are kernel functions that make it possible to map non-linear data into linearity and then do regression on them.</p>
<p>RBFs are useful for low-dimensional data.</p>
Sat, 17 Feb 2018 15:30:10 +0100
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/02/17/neural-networks.html
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/02/17/neural-networks.htmlmachinelearningintelligencesupervisedBayesian Networks<p>These are exam preparation notes, subpar in quality and certainly not of divine quality.</p>
<p>See the <a href="/machine/learning/intelligence/supervised/2018/02/26/mi1-index.html">index with all articles in this series</a></p>
<hr />
<h2 id="bayes-rule">Bayes Rule</h2>
<script type="math/tex; mode=display">P(A|B) = \frac{P(B|A)P(A)}{P(B)}</script>
<h2 id="inference">Inference</h2>
<p>Bayesian Interference is about inferring probabilities from prior probabilities.</p>
<p>Given a set of prior events, a bayesian network estimates the probability for another event.</p>
<p><strong>Justification of using heuristics</strong>:</p>
<p>In real-world scenarios you never know the true probability of events.
To somehow try to estimate <script type="math/tex">P</script> it makes sense to look at historic events and infer its statistical probability into the future.</p>
<p>To get these probabilities, a set of of so-called atomic events. Atomic events a value for each random variable and are mutually exclusive.</p>
<p>The problem with bayesian networks is that a fully connected network has an upper bound of connections of <script type="math/tex">2^N</script>. This makes it impractical to precalculate all inferred probabilties.</p>
<p>This is why we need further prior knowledge. The additional prior knowledge in this case is the causality between events.
If the causality between events is known or can be ruled out, this affects the interconnections and drastically reduces the number of edges.</p>
<h3 id="from-unconditional-to-conditional">From unconditional to conditional</h3>
<script type="math/tex; mode=display">P(x|\underline e) = \frac{P(x,\underline e)}{P(\underline e)} = \alpha P(x, \underline e) =\alpha \sum_\underline y P(x,\underline e, \underline y)</script>
<p><script type="math/tex">\alpha</script> can be found using:</p>
<script type="math/tex; mode=display">\frac{1}{\alpha}=P(\underline e)=\sum_{x,\underline y} P(x,\underline e, \underline y)</script>
<p>Summing up over x and y is posisble because they both sum to one, leaving only <script type="math/tex">P(\underline e)</script></p>
<h3 id="conditional-independence">Conditional independence</h3>
<p>X and Y are conditionally independent given Z if</p>
<script type="math/tex; mode=display">P(X,Y|Z) = P(X|Z)P(Y|Z)</script>
<p>alternative notation</p>
<script type="math/tex; mode=display">X ⫫ Y | Z</script>
<h3 id="markov-blanket">Markov Blanket</h3>
<p>If the Markov Blanket of a node <em>A</em> is given, A is conditionally independent.</p>
<p>Members of the Markov blanket are:</p>
<ul>
<li>parents,</li>
<li>children,</li>
<li>parents of the children</li>
</ul>
<p><img src="/assets/Diagram_of_a_Markov_blanket.svg" alt="A bad picture from wikipedia" /></p>
<h3 id="topological-sorting">Topological Sorting</h3>
<p>Sort a graph in an order, so that no node comes prior to a node pointing to it.</p>
<h3 id="message-passing">Message Passing</h3>
<p><a href="https://www.youtube.com/watch?v=6k7o3-UzUM0">in Bipartite Trees</a></p>
<p>In bipartite trees there are three phases of message passing</p>
<ul>
<li>request</li>
<li>collect</li>
<li>distribute</li>
</ul>
<p>Unaffected nodes are marginialized out. If at a junction, multiply with both edges.</p>
<p><em>Note:</em> The reason the bipartite tree is undirected is because bayesian probabilities are reversible. Also the nodes <script type="math/tex">f_i</script> are already aware of the direction.</p>
<h3 id="junction-trees">Junction Trees</h3>
<p>A junction tree (aka clique tree, join tree) is a decomposed graph that is now a tree with special properties.
This makes solving certain problems, especially interference easier.</p>
<p>Not all acyclic graphs map to trees. Junction trees only exist iff graph is decomposable.</p>
<p>A <strong>separator</strong> separates two sets of nodes <em>A</em> and <em>B</em> if every path from <em>A</em> to <em>B</em> has to path through the separator.</p>
<p>A <strong>proper decomposition</strong> <em>A,B,C</em> if <em>C</em> is separator to <em>A</em> and <em>B</em>, <em>C</em> is complete.</p>
<p>A set of nodes is <strong>complete</strong> if all nodes are fully interconnected.</p>
<p>A <strong>clique</strong> is a maximally complete subgraph.</p>
<blockquote>
<p>[A <strong>moral graph</strong> is the] counterpart of a directed acyclic graph is formed by adding edges between all pairs of nodes that have a common child, and then making all edges in the graph undirected.</p>
</blockquote>
<p><a href="https://en.wikipedia.org/wiki/Moral_graph">Source</a></p>
<h4 id="composing-the-tree">Composing the tree</h4>
<ul>
<li>construct DAG</li>
<li>convert to <a href="https://en.wikipedia.org/wiki/Moral_graph">moral graph</a></li>
<li>construct a <a href="https://en.wikipedia.org/wiki/Chordal_graph">chordal graph</a></li>
<li>identify cliques</li>
<li>construct bipartite graph</li>
<li>create junction tree</li>
<li>Do <a href="https://en.wikipedia.org/wiki/Belief_propagation">message passing</a></li>
</ul>
Thu, 11 Jan 2018 15:30:10 +0100
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/01/11/bayesian-networks.html
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/01/11/bayesian-networks.htmlmachinelearningintelligencesupervisedStatistical Learning Theory<p>These are exam preparation notes, subpar in quality and certainly not of divine quality.</p>
<p>See the <a href="/machine/learning/intelligence/supervised/2018/02/26/mi1-index.html">index with all articles in this series</a></p>
<hr />
<p>In a classification problem the desired goal is to reduce the generalization error <script type="math/tex">E^G</script>.
Unfortunately during training it is only possible to evaluate the classifier against a limited amount of data - the test data set.
Therefore we can only measure <script type="math/tex">E^T</script>.</p>
<p>The problem we want to solve is to know how good our classifier actually is without additional data.</p>
<p>The statistical learning theory allows you to give an upper bound on the overfitting.</p>
<p>A good introduction into this <a href="https://www.youtube.com/watch?v=8yWG7fhCpTw">upper bound and the upcoming topic</a> can be found on Youtube.</p>
<p>There is the concept of the capacity of a classifier.</p>
<p>…</p>
Thu, 11 Jan 2018 10:16:10 +0100
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/01/11/statistical-learning-theory.html
https://www.niels-ole.com/machine/learning/intelligence/supervised/2018/01/11/statistical-learning-theory.htmlmachinelearningintelligencesupervised