<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Tales from the Tail of the Oplog]]></title><description><![CDATA[Thoughts, stories and ideas.]]></description><link>https://tail.oplog.rs/</link><image><url>https://tail.oplog.rs/favicon.png</url><title>Tales from the Tail of the Oplog</title><link>https://tail.oplog.rs/</link></image><generator>Ghost 5.1</generator><lastBuildDate>Thu, 16 Apr 2026 06:38:03 GMT</lastBuildDate><atom:link href="https://tail.oplog.rs/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Linux 6.894757 kPa]]></title><description><![CDATA[Looking at the load.1/load.5/load.15 charts is a daily chore for all system admins. Trying to make sense of it is a whole different thing.]]></description><link>https://tail.oplog.rs/linux-6895-pa/</link><guid isPermaLink="false">6295216622d546fd1e042610</guid><category><![CDATA[Linux]]></category><dc:creator><![CDATA[Nikola Petrović]]></dc:creator><pubDate>Thu, 20 Oct 2022 08:52:20 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1559590836-9eb74007ab44?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDV8fHByZXNzdXJlfGVufDB8fHx8MTY1Mzk0MzkyMw&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1559590836-9eb74007ab44?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDV8fHByZXNzdXJlfGVufDB8fHx8MTY1Mzk0MzkyMw&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" alt="Linux 6.894757 kPa"><p>Linux <strong>Pressure Stall Information</strong> is a kernel feature identifying and quantifying the pressure on different hardware resources, making it easier for the human looking at the stats to make sense of the unrevealing <em>load</em> number.</p><p>It will show different stats for CPU, Memory and I/O starvation, enabling granular understanding of computer workload.</p><h3 id="making-sense-of-the-title">Making Sense of the Title</h3><p>6.894757 kPa equals to 1 <strong>PSI</strong>, but I am European, and it is well known we can&apos;t comprehend Imperial units. This has absolutely nothing to do with <strong>Pressure Stall Information </strong>though.</p><h2 id="the-load-average">The Load Average</h2><p>CPU, RAM and I/O are essential for all systems` performance, and if any one of those is lacking for the workload a machine needs to handle, tasks will pile-up in queues, waiting for the lacking resource.</p><p>The usual way of assessing that is looking at the average load via <code>uptime</code>, <code>top</code> or something similar. Nothing wrong with that, but it doesn&apos;t tell us the nuances in why the system is under load, and what should we do to prevent it.</p><p>The average load includes all tasks in a <code>TASK_RUNNABLE</code> or <code>TASK_UNINTERRUPTIBLE</code> state, which is described as &quot;task waiting for a resource&quot;. The actions from the system admin are clearly different in case we&apos;re lacking cpu vs memory or io. The avg load number, thus, is a count, which will be helpful for use in conjunction with the PSI stats, as we will see later.</p><p>Granularity is also pretty coarse, given that we have load averages within 1, 5 and 15 minute windows. If we have a latency sensitive task, it could be getting into issues much before it moves the needle on a 1 minute average.</p><h2 id="in-comes-the-saviourpsi-stats">In Comes the Saviour - PSI stats</h2><p>The PSI stats will show a resource being waited on in averaging windows of 10, 60 and 300 seconds. If that isn&apos;t granular enough, there is also a <code>total</code> field which is an absolute stall time in microseconds, allowing for custom time averaging.</p><p>The stats are written to three files in <code>/proc/pressure</code> directory, named <code><strong>cpu</strong></code>, <code><strong>memory</strong></code> and <code><strong>io</strong></code>. Each of the files has two lines, starting with <code>some</code> and <code>full</code>. <code>Some</code> meaning that one or more tasks are delayed waiting for the resource, while <code>full</code> means none of the tasks are getting to use the resource in the percentage of time expressed.</p><h3 id="cpu">CPU</h3><p>The <code><strong>/proc/pressure/cpu</strong></code> file will show both <code>some</code> and <code>full</code> values.</p><p><code>Some</code> means one or more tasks are waiting for the given percentage of time.</p><p><code>Full</code> is there just to align the format with other two files, where this value makes more sense. In the case of cpu, it would mean we are constantly doing context switching, not doing anything useful, which shouldn&apos;t be possible.</p><h3 id="memory">Memory</h3><p>Waiting on memory, usually means waiting on <strong>swap-ins</strong>, page cache <strong>refaults</strong> or page <strong>reclaims</strong>. Since we usually have a limited amount of memory available compared to the data set, we need to swap pages in and out of it. In that process some pages get from the <code>inactive</code> to the <code>active</code> list (a page fault would get the page loaded into memory, but stored in the <code>inactive</code> list, and a <strong>refault</strong> would move it to the <code>active</code> list so it would get <strong>reclaimed </strong>later rather than sooner.</p><p>Low numbers for the <code>some</code> stat could be acceptable, but a high value for the <code>full</code> stat means the tasks spend the time <strong>thrashing</strong> their own pages in memory (loading and reclaiming in a cycle, before any real work can be done), and we have created too much memory pressure for the system to work properly. This is something to be considered problematic, even with low numbers.</p><p>The important point is that although this stat is about memory contention, it means the cpu cycles are getting wasted, i.e. doing unproductive work.</p><h3 id="io">I/O</h3><p>The i/o file is similar to memory in concept, and it is reporting on all i/o caused delays.</p><h2 id="relativity-is-important">Relativity is important</h2><p>Let&apos;s say we have a single task running on the single cpu core we have, and no other tasks are waiting in the queue. <strong><code>PSI/cpu</code></strong> would be 0. If we add one more task, making the two take turns on the cpu, without waiting for a different resource, for simplification, we would have <code>some <strong>PSI/cpu</strong></code> at a 100, meaning that one or more tasks are waiting for the cpu 100% of the time. If we have only two tasks, and the scheduler is built fairly, this will mean we have a perfect utilization of our CPU resource, given that no cycles are wasted.</p><p>This does take into account the number of cpus, so the PSI values are <strong>relative</strong> to the number of tasks actually running. If we had two tasks running on the two available cpu cores, and one task waiting, <code>some <strong>PSI/cpu</strong></code><strong> </strong>would be 50, as 50% of the number of tasks running are being held back in a queue. For the actual formulas used, you can have a look at the <a href="https://github.com/torvalds/linux/blob/v6.0/kernel/sched/psi.c"><a href="https://github.com/torvalds/linux/blob/v6.0/kernel/sched/psi.c">code</a> in the kernel repo</a>.</p><p>If this <code>some</code>/<code>full</code> logic is confusing, read this passage until it is clear:</p><p><em>		To calculate wasted potential (pressure) with multiple processors, we have to base our calculation on the number of non-idle tasks in conjunction with the number of available CPUs, which is the number of potential execution threads. SOME becomes then the proportion of delayed tasks to possible threads, and FULL is the share of possible threads that are unproductive due to delays.</em></p><h2 id="real-world-example">Real-world example</h2><p>We can look at an example of an Elasticsearch cluster under load. Comparing the charts for average load and PSI stats, we can see how they can reveal different valuable information.</p><h3 id="load">Load</h3><p>Should the load shown below be considered problematic?</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tail.oplog.rs/content/images/2022/10/Screenshot-2022-10-08-at-18.44.48.png" class="kg-image" alt="Linux 6.894757 kPa" loading="lazy" width="932" height="340" srcset="https://tail.oplog.rs/content/images/size/w600/2022/10/Screenshot-2022-10-08-at-18.44.48.png 600w, https://tail.oplog.rs/content/images/2022/10/Screenshot-2022-10-08-at-18.44.48.png 932w" sizes="(min-width: 720px) 720px"><figcaption>1 Minute Load average</figcaption></figure><p>Given this chart, we could see that the system is under some load. But, to assess how much, and does it present a problem, we&apos;d need to know the number of cpu cores on each instance in question, and, for memory or i/o driven load, we&apos;d be in the blind. For reference, data bearing instances in question have 16 cores. In case we have different instances working together, (e.g. Elasticsearch has master nodes, usually set up with smaller instances), it wouldn&apos;t be easy to look at one chart and deduct the relative load level.</p><h3 id="psi">PSI</h3><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tail.oplog.rs/content/images/2022/10/Screenshot-2022-10-08-at-18.46.27.png" class="kg-image" alt="Linux 6.894757 kPa" loading="lazy" width="924" height="356" srcset="https://tail.oplog.rs/content/images/size/w600/2022/10/Screenshot-2022-10-08-at-18.46.27.png 600w, https://tail.oplog.rs/content/images/2022/10/Screenshot-2022-10-08-at-18.46.27.png 924w" sizes="(min-width: 720px) 720px"><figcaption>Pressure Stall Information - some tasks starved on CPU (% of time)</figcaption></figure><p>Here, CPU stats are showing some of the instances having tasks waiting on CPU near 100% of the time. This is clearly a potential issue, and it is easy to grasp on the relative starvation even with differently sized instances on one chart.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tail.oplog.rs/content/images/2022/10/Screenshot-2022-10-08-at-18.51.25.png" class="kg-image" alt="Linux 6.894757 kPa" loading="lazy" width="926" height="354" srcset="https://tail.oplog.rs/content/images/size/w600/2022/10/Screenshot-2022-10-08-at-18.51.25.png 600w, https://tail.oplog.rs/content/images/2022/10/Screenshot-2022-10-08-at-18.51.25.png 926w" sizes="(min-width: 720px) 720px"><figcaption>Pressure Stall Information - some (all - full) tasks waiting on memory (% of time)</figcaption></figure><p>We can rule out memory pressure as a driver in this case, as under 0.5% of the time are tasks waiting on memory.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://tail.oplog.rs/content/images/2022/10/Screenshot-2022-10-08-at-18.51.34.png" class="kg-image" alt="Linux 6.894757 kPa" loading="lazy" width="926" height="354" srcset="https://tail.oplog.rs/content/images/size/w600/2022/10/Screenshot-2022-10-08-at-18.51.34.png 600w, https://tail.oplog.rs/content/images/2022/10/Screenshot-2022-10-08-at-18.51.34.png 926w" sizes="(min-width: 720px) 720px"><figcaption>Pressure Stall Information - some (all - full) tasks waiting on I/O (% of time)</figcaption></figure><p>In this case, we can see i/o hold-up is not too high, under 20% of the time, except for one spike. It is the main driver of the load between 2:45 and 3:45 though. If it was higher, we should consider upgrading the i/o throughput somehow.</p><p>Looking at all three <strong>PSI</strong> charts, we can see that CPU has been the cause of starvation, and, looking at the <strong>load avg</strong> chart, we had roughly the same number of tasks waiting in the queue, as we had running on the 16 cpu cores of the most affected instances.</p><p>If we are noticing issues elsewhere in the system, we would need to scale up based on cpu cores. We could choose different types of instances, or just add more of them in case of Elasticsearch, which is able to scale up both horizontally and vertically.</p><h2 id="conclusion">Conclusion</h2><p>If we know which resource we are lacking in, we could fine tune the set up, use different hardware, or deduct more precisely which part of our application (creating the queries) might be the load driver. Using <strong>PSI</strong> stats along with <strong>load avg</strong> gives us an excellent insight into the state of our system.</p><h2 id="dig-deeper">Dig Deeper</h2><p><a href="https://github.com/torvalds/linux/blob/v6.0/kernel/sched/psi.c">https://github.com/torvalds/linux/blob/v6.0/kernel/sched/psi.c</a></p><p><a href="https://github.com/torvalds/linux/blob/master/mm/workingset.c">https://github.com/torvalds/linux/blob/master/mm/workingset.c</a></p><p><a href="https://en.wikipedia.org/wiki/Thrashing_%28computer_science%29">https://en.wikipedia.org/wiki/Thrashing_%28computer_science%29</a></p><p><a href="https://biriukov.dev/docs/page-cache/0-linux-page-cache-for-sre/">https://biriukov.dev/docs/page-cache/0-linux-page-cache-for-sre/</a></p><p><a href="https://lwn.net/Articles/759658/">https://lwn.net/Articles/759658/</a></p><p><a href="https://docs.kernel.org/accounting/psi.html">https://docs.kernel.org/accounting/psi.html</a></p><p><a href="https://facebookmicrosites.github.io/psi/docs/overview">https://facebookmicrosites.github.io/psi/docs/overview</a></p>]]></content:encoded></item><item><title><![CDATA['Hello' to This Blog and MongoDB]]></title><description><![CDATA[The implicit change is that the hello response could be expected to take some time, as "Awaitable hello or legacy hello Server Specification" states ...]]></description><link>https://tail.oplog.rs/hello-blog-mongodb/</link><guid isPermaLink="false">627e965f621c02a7c7ee5430</guid><category><![CDATA[MongoDB]]></category><dc:creator><![CDATA[Nikola Petrović]]></dc:creator><pubDate>Mon, 16 May 2022 06:49:55 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1612324390726-14d51b4a66e4?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDIzfHxtYXN0ZXJ8ZW58MHx8fHwxNjUyNDY0NDQ3&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1612324390726-14d51b4a66e4?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=MnwxMTc3M3wwfDF8c2VhcmNofDIzfHxtYXN0ZXJ8ZW58MHx8fHwxNjUyNDY0NDQ3&amp;ixlib=rb-1.2.1&amp;q=80&amp;w=2000" alt="&apos;Hello&apos; to This Blog and MongoDB"><p>This being the first post of this blog, it seems right for the topic to be about <code>hello</code>.</p><p>Previously, MongoDB had a function called <code><a href="https://www.mongodb.com/docs/v4.2/reference/method/db.isMaster/">isMaster</a></code> which got deprecated out of political correctness. Like in other computer systems, the master/slave metaphor had to go due to the outbreak of woke culture. Although it does make sense in some cases, this discussion is beyond the topic of this post (but we might come back to it), so let&apos;s focus on the replacement function, <code>hello</code>.</p><h2 id="whats-the-purpose">What&apos;s the Purpose?</h2><p>As the <a href="https://www.mongodb.com/docs/v4.4/reference/command/hello/#hello">documentation</a> states:</p><!--kg-card-begin: markdown--><blockquote>
<p>MongoDB drivers and clients use <code>hello</code> to determine the state of the replica set members and to discover additional members of a replica set.</p>
</blockquote>
<!--kg-card-end: markdown--><h2 id="are-there-any-changes">Are There any Changes?</h2><p>Yes, there are:</p><ol><li>Now <strong>renamed</strong> to <code>hello</code>, the old command, <code>isMaster</code> is referred to as &apos;legacy hello&apos; in some documentation.</li><li>Moving to <strong>streaming protocol</strong> for monitoring MongoDB 4.4+ servers, to reduce the time it takes for a client to discover server state changes.</li></ol><p>The implicit change is that the <code>hello</code> response could be expected to take some time, as &quot;Awaitable hello or legacy hello <a href="https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-monitoring.rst#id29">Server Specification</a>&quot; states:</p><!--kg-card-begin: markdown--><blockquote>
<p>As of MongoDB 4.4 the hello or legacy hello command can wait to reply until there is a topology change or a maximum time has elapsed. Clients opt in to this &quot;awaitable hello&quot; feature by passing new parameters &quot;topologyVersion&quot; and &quot;maxAwaitTimeMS&quot; to the hello or legacy hello commands.</p>
</blockquote>
<!--kg-card-end: markdown--><p>By looking at multiple internal MongoDB Core Server tickets, and this example from calling <code>hello</code> on a local <code>mongosh</code>, the usual <code>await</code> time is 10s:</p><pre><code class="language-json">{
      type: &apos;op&apos;,
      host: &apos;mtrusanj.local:27017&apos;,
      desc: &apos;conn3&apos;,
      connectionId: 3,
      client: &apos;127.0.0.1:49482&apos;,
      appName: &apos;mongosh 1.4.1&apos;,
      clientMetadata: {
        driver: { name: &apos;nodejs|mongosh&apos;, version: &apos;4.6.0&apos; },
        os: {
          type: &apos;Darwin&apos;,
          name: &apos;darwin&apos;,
          architecture: &apos;x64&apos;,
          version: &apos;21.4.0&apos;
        },
        platform: &apos;Node.js v16.15.0, LE (unified)&apos;,
        version: &apos;4.6.0|1.4.1&apos;,
        application: { name: &apos;mongosh 1.4.1&apos; }
      },
      active: true,
      currentOpTime: &apos;2022-05-14T13:45:48.453+02:00&apos;,
      threaded: true,
      opid: 5451,
      secs_running: Long(&quot;1&quot;),
      microsecs_running: Long(&quot;1808246&quot;),
      op: &apos;command&apos;,
      ns: &apos;admin.$cmd&apos;,
      command: {
        hello: true,
        maxAwaitTimeMS: 10000,
        topologyVersion: {
          processId: ObjectId(&quot;627f941da09af704219843b0&quot;),
          counter: Long(&quot;0&quot;)
        },
        &apos;$db&apos;: &apos;admin&apos;
      },
      numYields: 0,
      waitingForLatch: {
        timestamp: ISODate(&quot;2022-05-14T11:45:46.751Z&quot;),
        captureName: &apos;AnonymousLatch&apos;
      },
      locks: {},
      waitingForLock: false,
      lockStats: {},
      waitingForFlowControl: false,
      flowControlStats: {}
    }</code></pre><h2 id="why-is-this-important">Why is this important?</h2><p>If you have been monitoring any slow operations in <code>currentOp()</code>, after upgrading to MongoDB 4.4+, you may have seen <code>hello</code> commands being reported as &apos;slow&apos;, or taking more than 1s.</p><p>As discussed, this is now expected behaviour, and should be ignored in any monitoring applications.</p>]]></content:encoded></item></channel></rss>