<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Monitoring | 2i2c</title><link>https://deploy-preview-614--2i2c-org.netlify.app/tag/monitoring/</link><atom:link href="https://deploy-preview-614--2i2c-org.netlify.app/tag/monitoring/index.xml" rel="self" type="application/rss+xml"/><description>Monitoring</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 16 Dec 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-614--2i2c-org.netlify.app/media/sharing.png</url><title>Monitoring</title><link>https://deploy-preview-614--2i2c-org.netlify.app/tag/monitoring/</link></image><item><title>Improving our community hub reliability and stability in Q4 2025</title><link>https://deploy-preview-614--2i2c-org.netlify.app/blog/infrastructure-reliability-q4-2025/</link><pubDate>Tue, 16 Dec 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-614--2i2c-org.netlify.app/blog/infrastructure-reliability-q4-2025/</guid><description>&lt;p>This year we&amp;rsquo;ve prioritized &lt;strong>making the cloud safe to try&lt;/strong> for our member communities. This has driven work in monitoring, alerting, and automating infrastructure so that we resolve small problems before they become big problems. In the last quarter of 2025, we wrapped up this effort by testing the following hypothesis:&lt;/p>
&lt;blockquote>
&lt;p>We can reduce P1 incidents if we shorten the time to act on current alerts and learnings from prior incidents.&lt;/p>
&lt;/blockquote>
&lt;p>Here&amp;rsquo;s what we accomplished and what we learned.&lt;/p>
&lt;h2 id="what-we-accomplished">
What we accomplished
&lt;a class="header-anchor" href="#what-we-accomplished">#&lt;/a>
&lt;/h2>&lt;p>In short: we&amp;rsquo;re now much more confident in the stability of community infrastructure.
Here&amp;rsquo;s a snapshot of our new incident dashboard, which shows high-level trends for the stability of our infrastructure:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Dashboard of pagerduty status page for 2i2c" srcset="
/blog/infrastructure-reliability-q4-2025/featured_hu04df3383ec51b90b248012f6472de1e6_185237_a47d9c707f54757cba94700be6c3c216.webp 400w,
/blog/infrastructure-reliability-q4-2025/featured_hu04df3383ec51b90b248012f6472de1e6_185237_a6c12809ca27d3fc4c1c81f7b28ea33a.webp 760w,
/blog/infrastructure-reliability-q4-2025/featured_hu04df3383ec51b90b248012f6472de1e6_185237_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-614--2i2c-org.netlify.app/blog/infrastructure-reliability-q4-2025/featured_hu04df3383ec51b90b248012f6472de1e6_185237_a47d9c707f54757cba94700be6c3c216.webp"
width="760"
height="394"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;em>See the real-time status of our community hubs at
&lt;a href="http://status.2i2c.org" target="_blank" rel="noopener" >status.2i2c.org&lt;/a>&lt;/em>&lt;/p>
&lt;h3 id="we-improved-infrastructure-reliability-for-our-communities">
We improved infrastructure reliability for our communities
&lt;a class="header-anchor" href="#we-improved-infrastructure-reliability-for-our-communities">#&lt;/a>
&lt;/h3>&lt;p>We made several technology and team process improvements that led to these benefits for our communities:&lt;/p>
&lt;ol>
&lt;li>We are now more likely to catch outages before a community reports them to us.&lt;/li>
&lt;li>We are now less likely to have an outage happen more than once, or affect more than one community, because we consistently fix the issues that cause outages.&lt;/li>
&lt;/ol>
&lt;p>We saw a consistent drop in critical alerts that required immediate response:&lt;/p>
&lt;ul>
&lt;li>For August and September we had an average of 7 outages/month (6 from alerts, 1 from community)&lt;/li>
&lt;li>In October, November, and December we had an average of 3 outages/month (9 in October, 0 in November, 1 in December, with only one of these being reported by a community)&lt;/li>
&lt;/ul>
&lt;h3 id="we-became-more-efficient-responsive-and-focused">
We became more efficient, responsive, and focused
&lt;a class="header-anchor" href="#we-became-more-efficient-responsive-and-focused">#&lt;/a>
&lt;/h3>&lt;p>We also got several team benefits from this work:&lt;/p>
&lt;ol>
&lt;li>We get fewer interruptions and distractions from deeper work.&lt;/li>
&lt;li>We have clear assignment policies to make it clear who is responsible for acting in response to alerts.&lt;/li>
&lt;li>We avoid invisible work from falling down rabbit-holes when responding to outages.&lt;/li>
&lt;li>We decreased the stress and pressure of doing upgrades, making them easier to split into sprint items and more likely to get done consistently.&lt;/li>
&lt;/ol>
&lt;h2 id="the-improvements-we-made">
The improvements we made
&lt;a class="header-anchor" href="#the-improvements-we-made">#&lt;/a>
&lt;/h2>
&lt;h3 id="infrastructure-improvements">
Infrastructure improvements
&lt;a class="header-anchor" href="#infrastructure-improvements">#&lt;/a>
&lt;/h3>&lt;ul>
&lt;li>Created a
&lt;a href="http://status.2i2c.org" target="_blank" rel="noopener" >status page for all 2i2c community hubs&lt;/a>, giving our team and communities visibility into the status of our infrastructure.&lt;/li>
&lt;li>Created an alert that triggers when two servers fail to start consecutively in a 30-minute time window.&lt;/li>
&lt;li>Improved deployment infrastructure so that we can roll out sub-chart upgrades to individual clusters, allowing us to roll out major changes in batches.&lt;/li>
&lt;li>Removed our &amp;ldquo;configurator&amp;rdquo; application from community hubs, because it was causing more confusion than it was resolving.&lt;/li>
&lt;li>Allowed servers to start even when users hit their storage quotas.&lt;/li>
&lt;li>Provided a number of upgrades to Kubernetes and the support services that we run alongside each community hub.&lt;/li>
&lt;/ul>
&lt;h3 id="process-improvements">
Process improvements
&lt;a class="header-anchor" href="#process-improvements">#&lt;/a>
&lt;/h3>&lt;ul>
&lt;li>Made a team commitment to prioritize issues from
&lt;a href="https://2i2c.org/incident-reports" target="_blank" rel="noopener" >incident reports&lt;/a> and other stability-related problems.&lt;/li>
&lt;li>Defined incident
&lt;a href="https://infrastructure.2i2c.org/topic/monitoring-alerting/escalation-policies/" target="_blank" rel="noopener" >escalation policies&lt;/a> using the
&lt;a href="http://status.2i2c.org" target="_blank" rel="noopener" >status page&lt;/a> to calibrate the urgency of our response to the severity of incidents.&lt;/li>
&lt;li>Defined &amp;ldquo;on-call&amp;rdquo; procedures so our team knows when and how to be more responsive to outages.&lt;/li>
&lt;li>Time-boxed our alert response process to avoid accidentally falling down rabbit holes for non-urgent problems.&lt;/li>
&lt;li>Created a more reliable process for
&lt;a href="https://infrastructure.2i2c.org/topic/monitoring-alerting/escalation-policies/" target="_blank" rel="noopener" >responding to incidents&lt;/a> and writing
&lt;a href="https://2i2c.org/incident-reports" target="_blank" rel="noopener" >incident reports&lt;/a>.&lt;/li>
&lt;/ul>
&lt;h2 id="looking-forward">
Looking forward
&lt;a class="header-anchor" href="#looking-forward">#&lt;/a>
&lt;/h2>&lt;p>After this push around infrastructure reliability, we&amp;rsquo;re significantly more confident in the stability and transparency of our community hub infrastructure. This will deliver better service for our member communities and free up more of our time to engage with them instead of fighting infrastructure fires.&lt;/p>
&lt;p>We will continue to improve our infrastructure, and have a better foundation to do so incrementally in the coming quarters. Here are a few things we&amp;rsquo;d still like to improve:&lt;/p>
&lt;ol>
&lt;li>We still need to improve how reliably we complete follow-up actions from incidents (e.g., writing incident reports). When a process doesn&amp;rsquo;t fit into planning &amp;amp; scoping ceremonies, we struggle to follow it consistently.&lt;/li>
&lt;li>We&amp;rsquo;d like to improve our testing framework for major upgrades across all hubs (e.g., Kubernetes version upgrades) to catch bugs before communities do.&lt;/li>
&lt;/ol>
&lt;h2 id="learn-more">
Learn More
&lt;a class="header-anchor" href="#learn-more">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>
&lt;a href="http://status.2i2c.org/" target="_blank" rel="noopener" >2i2c Status Page&lt;/a>&lt;/li>
&lt;li>
&lt;a href="https://infrastructure.2i2c.org/hub-deployment-guide/runbooks/on-call/" target="_blank" rel="noopener" >On-call procedures documentation&lt;/a>&lt;/li>
&lt;li>
&lt;a href="https://github.com/2i2c-org/infrastructure" target="_blank" rel="noopener" >Infrastructure repository&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Faster reporting of user home directory sizes</title><link>https://deploy-preview-614--2i2c-org.netlify.app/blog/faster-home-directory-reporting/</link><pubDate>Tue, 09 Dec 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-614--2i2c-org.netlify.app/blog/faster-home-directory-reporting/</guid><description>&lt;p>Storage quotas help users avoid running out of space unexpectedly and give administrators visibility into capacity planning. However, storage usage can change rapidly, and it&amp;rsquo;s important to have quick information so that administrators know whether they are close to hitting limits.&lt;/p>
&lt;p>We&amp;rsquo;ve improved how quickly hub administrators can see user home directory sizes across our JupyterHubs. This makes monitoring more responsive and adds quota limit visibility that wasn&amp;rsquo;t possible before.&lt;/p>
&lt;h2 id="using-jupyterhub-home-nfs-for-near-instant-disk-usage-metrics">
Using &lt;code>jupyterhub-home-nfs&lt;/code> for near-instant disk usage metrics
&lt;a class="header-anchor" href="#using-jupyterhub-home-nfs-for-near-instant-disk-usage-metrics">#&lt;/a>
&lt;/h2>&lt;p>Our existing storage monitoring tool,
&lt;a href="https://github.com/2i2c-org/prometheus-dirsize-exporter" target="_blank" rel="noopener" >&lt;code>prometheus-dirsize-exporter&lt;/code>&lt;/a>, deliberately runs slowly to avoid excessive disk I/O. This meant home directory metrics could be &lt;strong>hours out of date&lt;/strong> on systems with many users or large directories. Plus, there was no way to report user quota limits at all.&lt;/p>
&lt;p>Our home directory storage is managed by
&lt;a href="https://github.com/2i2c-org/jupyterhub-home-nfs/" target="_blank" rel="noopener" >&lt;code>jupyterhub-home-nfs&lt;/code>&lt;/a>, which enforces per-user quotas. It could also expose usage and limit information as Prometheus metrics using data from the underlying filesystem quota system. Because this information is already tracked by the filesystem, it&amp;rsquo;s available immediately without scanning individual files.&lt;/p>
&lt;p>We made two key improvements:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Make disk usage reporting almost instantaneous&lt;/strong>. We made &lt;code>jupyterhub-home-nfs&lt;/code> export &lt;code>total_size_bytes&lt;/code> and &lt;code>hard_limit_bytes&lt;/code> metrics to Prometheus for near-instant reporting. We used the same metric names and namespace as &lt;code>prometheus-dirsize-exporter&lt;/code> for compatibility. See
&lt;a href="https://github.com/2i2c-org/jupyterhub-home-nfs/pull/76" target="_blank" rel="noopener" >&lt;i class='fa-brands fa-github'>&lt;/i> 2i2c-org/jupyterhub-home-nfs#76&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Allow this to be used upstream in JupyterHub Grafana Dashboards&lt;/strong> so that it can support both types of disk usage reporting. This means users of the upstream
&lt;a href="https://github.com/jupyterhub/grafana-dashboards" target="_blank" rel="noopener" >JupyterHub Grafana dashboards&lt;/a> get the same useful view about home directory usage, regardless of whether the metric comes from &lt;code>prometheus-dirsize-exporter&lt;/code> or &lt;code>jupyterhub-home-nfs&lt;/code>. See
&lt;a href="https://github.com/2i2c-org/prometheus-dirsize-exporter/pull/29" target="_blank" rel="noopener" >&lt;i class='fa-brands fa-github'>&lt;/i> 2i2c-org/prometheus-dirsize-exporter#29&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>These changes were
&lt;a href="https://github.com/2i2c-org/infrastructure/pull/7261" target="_blank" rel="noopener" >deployed across all our communities&lt;/a>, so administrators can now access current home directory information &lt;strong>within minutes&lt;/strong> regardless of directory size.&lt;/p>
&lt;figure id="figure-home-directory-usage-dashboard-showing-total-size-metrics-from-jupyterhub-home-nfs-and-other-data-from-prometheus-dirsize-exporter">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Home Directory Usage dashboard showing total size metrics from jupyterhub-home-nfs and other data from prometheus-dirsize-exporter" srcset="
/blog/faster-home-directory-reporting/featured_hu5e6047328de0a056370b6f6f7ca4f2f4_42503_ededa5ff37780d5501ea74e6e73f6926.webp 400w,
/blog/faster-home-directory-reporting/featured_hu5e6047328de0a056370b6f6f7ca4f2f4_42503_a995b186c4e39c1fd078545f235e8394.webp 760w,
/blog/faster-home-directory-reporting/featured_hu5e6047328de0a056370b6f6f7ca4f2f4_42503_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-614--2i2c-org.netlify.app/blog/faster-home-directory-reporting/featured_hu5e6047328de0a056370b6f6f7ca4f2f4_42503_ededa5ff37780d5501ea74e6e73f6926.webp"
width="760"
height="152"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Home Directory Usage dashboard showing total size metrics from jupyterhub-home-nfs and other data from prometheus-dirsize-exporter
&lt;/figcaption>&lt;/figure>
&lt;h2 id="try-it-out">
Try it out
&lt;a class="header-anchor" href="#try-it-out">#&lt;/a>
&lt;/h2>&lt;p>2i2c member organizations can try this out now. If you have access to your hub&amp;rsquo;s Grafana instance, you can see these new metrics in the &lt;em>Home Directory Usage&lt;/em> dashboard:&lt;/p>
&lt;ol>
&lt;li>Open your hub&amp;rsquo;s
&lt;a href="https://docs.2i2c.org/admin/monitoring/grafana-dashboards/" target="_blank" rel="noopener" >Grafana dashboard&lt;/a>.&lt;/li>
&lt;li>Go to &lt;code>Dashboards&lt;/code> -&amp;gt; &lt;code>JupyterHub Default Dashboards&lt;/code> -&amp;gt; &lt;code>Home Directory Usage&lt;/code>.&lt;/li>
&lt;li>Check the table for up-to-date &lt;em>total size&lt;/em> and &lt;em>quota limit&lt;/em> values.&lt;/li>
&lt;/ol>
&lt;p>For more details, see our
&lt;a href="https://docs.2i2c.org/admin/monitoring/disk-usage/" target="_blank" rel="noopener" >docs on filesystem and disk dashboards&lt;/a>.&lt;/p>
&lt;h2 id="coming-next">
Coming next
&lt;a class="header-anchor" href="#coming-next">#&lt;/a>
&lt;/h2>&lt;p>We&amp;rsquo;d like to build on this work to enable &lt;strong>alerting when individual users near their disk quotas&lt;/strong>. This will make it easier to more reliably track user disk usage across a community. See this issue for tracking:
&lt;a href="https://github.com/2i2c-org/infrastructure/issues/7166" target="_blank" rel="noopener" >&lt;i class='fa-brands fa-github'>&lt;/i> 2i2c-org/infrastructure#7166&lt;/a>&lt;/p>
&lt;h2 id="acknowledgements">
Acknowledgements
&lt;a class="header-anchor" href="#acknowledgements">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>This was a directed contribution supported by
&lt;a href="https://deploy-preview-614--2i2c-org.netlify.app/collaborators/nasa-veda/" >NASA VEDA&lt;/a> to enable more proactive monitoring and alerting for hub administrators.&lt;/li>
&lt;/ul></description></item><item><title>Adding User Group Insights to Cloud Cost Dashboards with Grafana</title><link>https://deploy-preview-614--2i2c-org.netlify.app/blog/cloud-cost-groups/</link><pubDate>Mon, 24 Nov 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-614--2i2c-org.netlify.app/blog/cloud-cost-groups/</guid><description>&lt;p>We are excited to announce that we have extended our cloud cost dashboards to support display costs filtered by user groups using Grafana! This new feature allows administrators to monitor and manage cloud expenses based on user group memberships in JupyterHub.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Group cloud cost dashboard showing cost breakdowns by user groups" srcset="
/blog/cloud-cost-groups/featured_hu34fd6e3a049030056ef3072c1a0427ac_131153_c2b7e8d83fe14bfbc24fc804e952e390.webp 400w,
/blog/cloud-cost-groups/featured_hu34fd6e3a049030056ef3072c1a0427ac_131153_694d12ebbde0c6b897972885357ca71d.webp 760w,
/blog/cloud-cost-groups/featured_hu34fd6e3a049030056ef3072c1a0427ac_131153_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-614--2i2c-org.netlify.app/blog/cloud-cost-groups/featured_hu34fd6e3a049030056ef3072c1a0427ac_131153_c2b7e8d83fe14bfbc24fc804e952e390.webp"
width="760"
height="388"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;div class="alert alert-">
&lt;div>
Available for dedicated AWS clusters only (and excluding CloudBank managed accounts). Other deployments on GCP will be supported in the future.
&lt;/div>
&lt;/div>
&lt;h2 id="learn-more">
Learn more
&lt;a class="header-anchor" href="#learn-more">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Take a look at the
&lt;a href="https://docs.2i2c.org/admin/monitoring/cost-users-groups/#group-cloud-costs" target="_blank" rel="noopener" >Community Hub Guide&lt;/a> to see what&amp;rsquo;s new&lt;/li>
&lt;li>Check out the documentation of the
&lt;a href="https://jupyterhub-cost-monitoring.readthedocs.io/en/latest/" target="_blank" rel="noopener" >2i2c-org/jupyterhub-cost-monitoring&lt;/a> project to see how it all works&lt;/li>
&lt;li>
&lt;a href="https://deploy-preview-614--2i2c-org.netlify.app/author/jenny-wong/" >Jenny&lt;/a> recently presented her work on the cost monitoring system at
&lt;a href="https://events.linuxfoundation.org/jupytercon/" target="_blank" rel="noopener" >JupyterCon 2025&lt;/a> earlier this month. Watch a
&lt;a href="https://youtu.be/M5x3bTgRzVs?si=P2c3Ngb8v7f4ks0I" target="_blank" rel="noopener" >video&lt;/a> or look at the
&lt;a href="https://docs.google.com/presentation/d/1N8V7dna1atpRmcbpgZ0-VL5cbOQfwYfXTstudT2ierY/edit?usp=sharing" target="_blank" rel="noopener" >slides&lt;/a>.&lt;/li>
&lt;/ul>
&lt;a href="https://docs.google.com/forms/d/e/1FAIpQLSff-u-sWFuwO1-VTgk2Ir7f1nfUUlLevQk_Vkk_jnmcI1nJnw/viewform?usp=pp_url&amp;amp;entry.648332035=https://deploy-preview-614--2i2c-org.netlify.app/blog/cloud-cost-groups/" target="_blank" rel="noopener" class="text-decoration-none">
&lt;div class="alert alert-info d-flex align-items-start p-3" role="button" style="transition: all 0.2s ease; box-shadow: 0 2px 4px rgba(0,0,0,0.1);" onmouseover="this.style.backgroundColor='#b3e5fc'; this.style.boxShadow='0 4px 8px rgba(0,0,0,0.15)'; this.style.transform='translateY(-1px)'" onmouseout="this.style.backgroundColor=''; this.style.boxShadow='0 2px 4px rgba(0,0,0,0.1)'; this.style.transform='translateY(0)'" onfocus="this.style.backgroundColor='#b3e5fc'; this.style.boxShadow='0 4px 8px rgba(0,0,0,0.15)'; this.style.transform='translateY(-1px)'" onblur="this.style.backgroundColor=''; this.style.boxShadow='0 2px 4px rgba(0,0,0,0.1)'; this.style.transform='translateY(0)'">
&lt;div class="fw-bold mb-1">&lt;span style="font-weight:bold">Give us feedback!&lt;/span> Click here to provide feedback that will help us make this more impactful.&lt;/div>
&lt;/div>
&lt;/a>
&lt;h2 id="acknowledgements">
Acknowledgements
&lt;a class="header-anchor" href="#acknowledgements">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>
&lt;a href="https://github.com/sunu" target="_blank" rel="noopener" >Tarashish&lt;/a> @
&lt;a href="https://deploy-preview-614--2i2c-org.netlify.app/collaborators/devseed/" >Development Seed&lt;/a> for collaborating on this project with us.&lt;/li>
&lt;li>
&lt;a href="https://deploy-preview-614--2i2c-org.netlify.app/collaborators/nasa-veda/" >NASA VEDA&lt;/a> and the DSE Team at NASA MSFC ODSI for funding much of this work.&lt;/li>
&lt;li>
&lt;a href="https://github.com/kyle-lesinger" target="_blank" rel="noopener" >Kyle Lesinger&lt;/a> from the NASA MSFC Office of Data Science and Informatics for providing valuable feedback and bug reports during development.&lt;/li>
&lt;/ul></description></item><item><title>Announcing `jupyterhub-groups-exporter`: monitor usage based on JupyterHub group membership with Prometheus and Grafana</title><link>https://deploy-preview-614--2i2c-org.netlify.app/blog/jupyterhub-groups-exporter/</link><pubDate>Wed, 11 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-614--2i2c-org.netlify.app/blog/jupyterhub-groups-exporter/</guid><description>&lt;p>Managing user groups in JupyterHub can be a challenging task, especially in environments with dynamic user bases and complex group structures. This post describes how we can leverage the latest group management features in JupyterHub, along with Prometheus and Grafana, to monitor group-level resource usage effectively.&lt;/p>
&lt;blockquote>
&lt;p>⭐ &lt;strong>Members of 2i2c&amp;rsquo;s community network&lt;/strong> can use this feature in their hubs by
&lt;a href="https://docs.2i2c.org/admin/monitoring/cost-users" target="_blank" rel="noopener" >following our cost attribution documentation&lt;/a>.&lt;/p>
&lt;/blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./featured.png" alt="Grafana User Group Diagnostics Dashboard showing a memory usage over time with each line aggregating usage over a different user group." loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="motivation">
Motivation
&lt;a class="header-anchor" href="#motivation">#&lt;/a>
&lt;/h2>&lt;p>Hub admins have a strong impetus to monitor usage and costs by user groups
because it allows them to advocate for better funding and cost recovery models based on data-driven insights. Group-level resource monitoring can help them to answer questions like:&lt;/p>
&lt;ul>
&lt;li>How many people participated in our workshop group?&lt;/li>
&lt;li>How much GPU compute is our power user group using?&lt;/li>
&lt;li>Is our resource usage cost-effective for X group persona or Y group persona?&lt;/li>
&lt;/ul>
&lt;p>Current methods and workarounds include:&lt;/p>
&lt;ul>
&lt;li>ring-fencing resources for specific user groups personas, e.g. creating a separate hub for a workshop group, or creating a separate Dask cluster for a power user group, which increases the admin burden of managing multiple hub instances&lt;/li>
&lt;li>writing custom scripts to aggregate per user metrics, that are already available, into groups – which can be time-consuming and error-prone&lt;/li>
&lt;/ul>
&lt;h2 id="jupyterhub-and-user-groups">
JupyterHub and user groups
&lt;a class="header-anchor" href="#jupyterhub-and-user-groups">#&lt;/a>
&lt;/h2>&lt;p>Recent key developments upstream in JupyterHub for groups management, such as
&lt;a href="https://jupyterhub.readthedocs.io/en/latest/reference/authenticators.html#authenticator-managed-group-membership" target="_blank" rel="noopener" >Authenticator managed group membership&lt;/a>, makes this piece of work a prime and timely opportunity to be tackled. For more technical details of these upstream contributions, see GitHub PRs
&lt;a href="https://github.com/jupyterhub/oauthenticator/pull/735" target="_blank" rel="noopener" >&lt;i class='fa-brands fa-github'>&lt;/i> jupyterhub/oauthenticator#735&lt;/a> and
&lt;a href="https://github.com/jupyterhub/oauthenticator/pull/498" target="_blank" rel="noopener" >&lt;i class='fa-brands fa-github'>&lt;/i> jupyterhub/oauthenticator#498&lt;/a>.&lt;/p>
&lt;p>Users can access JupyterHub using a variety of authentication methods. Authentication providers like GitHub have built-in user management features that allow admins to create and manage user groups. These groups can then be configured in JupyterHub to authorize access to the hub, as well as control access to certain hardware profiles.&lt;/p>
&lt;p>Following the key upstream contributions above, we can leverage
&lt;a href="https://jupyterhub.readthedocs.io/en/stable/reference/authenticators.html#authenticator-managed-group-membership" target="_blank" rel="noopener" >Authenticator-managed group membership&lt;/a> to automatically pass user group memberships from the authentication layer to JupyterHub itself. This allows us to capitalize on JupyterHub&amp;rsquo;s REST API to retrieve user group memberships from other
&lt;a href="https://jupyterhub.readthedocs.io/en/latest/reference/services.html" target="_blank" rel="noopener" >services&lt;/a>, such as exporting them as Prometheus metrics.&lt;/p>
&lt;h2 id="exporting-user-group-memberships-to-prometheus">
Exporting user group memberships to Prometheus
&lt;a class="header-anchor" href="#exporting-user-group-memberships-to-prometheus">#&lt;/a>
&lt;/h2>&lt;p>The
&lt;a href="https://github.com/2i2c-org/jupyterhub-groups-exporter" target="_blank" rel="noopener" >&lt;code>jupyterhub-groups-exporter&lt;/code>&lt;/a> project provides a
&lt;a href="https://jupyterhub.readthedocs.io/en/latest/reference/services.html" target="_blank" rel="noopener" >service&lt;/a> that integrates with JupyterHub to export user group memberships as Prometheus metrics. This component is readily deployable as part of any JupyterHub instance, such as a standalone deployment or a Zero to JupyterHub deployment on Kubernetes.&lt;/p>
&lt;p>The exporter provides a
&lt;a href="https://prometheus.io/docs/concepts/metric_types/" target="_blank" rel="noopener" >Gauge metric&lt;/a> called &lt;code>jupyterhub_user_group_info&lt;/code>, which contain the following labels:&lt;/p>
&lt;ul>
&lt;li>&lt;code>namespace&lt;/code> – the Kubernetes namespace where the JupyterHub is deployed&lt;/li>
&lt;li>&lt;code>usergroup&lt;/code> – the name of the user group&lt;/li>
&lt;li>&lt;code>username&lt;/code> – the unescaped username of the user&lt;/li>
&lt;li>&lt;code>username_escape&lt;/code> – the escaped username&lt;/li>
&lt;/ul>
&lt;p>Escaped usernames are useful because Kubernetes pods have characterset limits for valid pod label names (this limit does not apply to pod annotations). Storing both types of usernames allows us to join escaped versions with their more human-readable unescaped usernames.&lt;/p>
&lt;p>Exposing this metric as an endpoint for Prometheus to scrape allows us to query and join groups data with a range of usage metrics to gain powerful group-level insights. Here is an example PromQL query that retrieves the memory usage by user group:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-promql" data-lang="promql">&lt;span class="line">&lt;span class="cl">&lt;span class="k">sum&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nv">container_memory_working_set_bytes&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="nl">name&lt;/span>&lt;span class="o">!=&lt;/span>&lt;span class="p">&amp;#34;&amp;#34;,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nl">pod&lt;/span>&lt;span class="o">=~&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">jupyter-.*&lt;/span>&lt;span class="p">&amp;#34;,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nl">namespace&lt;/span>&lt;span class="o">=~&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">$hub_name&lt;/span>&lt;span class="p">&amp;#34;}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">on&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">namespace&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nv">pod&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">group_left&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">annotation_hub_jupyter_org_username&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nv">usergroup&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">group&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="nv">kube_pod_annotations&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="nl">namespace&lt;/span>&lt;span class="o">=~&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">$hub_name&lt;/span>&lt;span class="p">&amp;#34;,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nl">annotation_hub_jupyter_org_username&lt;/span>&lt;span class="o">=~&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">.*&lt;/span>&lt;span class="p">&amp;#34;,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nl">pod&lt;/span>&lt;span class="o">=~&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">jupyter-.*&lt;/span>&lt;span class="p">&amp;#34;}&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">by&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">pod&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nv">namespace&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nv">annotation_hub_jupyter_org_username&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="o">*&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">on&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">namespace&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nv">annotation_hub_jupyter_org_username&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">group_left&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">usergroup&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="k">group&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="kr">label_replace&lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">jupyterhub_user_group_info&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="nl">namespace&lt;/span>&lt;span class="o">=~&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">$hub_name&lt;/span>&lt;span class="p">&amp;#34;,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nl">username&lt;/span>&lt;span class="o">=~&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">.*&lt;/span>&lt;span class="p">&amp;#34;,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nl">usergroup&lt;/span>&lt;span class="o">=~&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">$user_group&lt;/span>&lt;span class="p">&amp;#34;},&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">annotation_hub_jupyter_org_username&lt;/span>&lt;span class="p">&amp;#34;,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">$1&lt;/span>&lt;span class="p">&amp;#34;,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">username&lt;/span>&lt;span class="p">&amp;#34;,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="s">(.+)&lt;/span>&lt;span class="p">&amp;#34;&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w"> &lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">by&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">annotation_hub_jupyter_org_username&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nv">usergroup&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nv">namespace&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="w">&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="k">by&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="o">(&lt;/span>&lt;span class="nv">usergroup&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="w"> &lt;/span>&lt;span class="nv">namespace&lt;/span>&lt;span class="o">)&lt;/span>&lt;span class="w">
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>
&lt;h2 id="visualizing-user-group-resource-usage-with-grafana">
Visualizing user group resource usage with Grafana
&lt;a class="header-anchor" href="#visualizing-user-group-resource-usage-with-grafana">#&lt;/a>
&lt;/h2>&lt;p>The PromQL query above is rather long and complex to construct! However, you can benefit from an
&lt;a href="https://github.com/jupyterhub/grafana-dashboards/pull/149" target="_blank" rel="noopener" >upstream contribution&lt;/a> to the
&lt;a href="https://github.com/jupyterhub/grafana-dashboards" target="_blank" rel="noopener" >jupyterhub/grafana-dashboards&lt;/a> project where we have encapsulated the PromQL queries as Jsonnet code and represented them as Grafana Dashboard visualizations (also known as
&lt;a href="https://grafana.github.io/grafonnet/index.html" target="_blank" rel="noopener" >Grafonnet&lt;/a>). If you have a Kubernetes cluster running JupyterHub, try deploying these Grafana Dashboards and let us know what you think!&lt;/p>
&lt;p>Our particular PromQL query above is visualized in the Grafana Dashboard &lt;strong>User Groups Diagnostics&lt;/strong> under the &lt;strong>Memory Usage&lt;/strong> panel (see also the corresponding screenshot at the top of this post). This is equivalent to its counterpart &lt;strong>User Diagnostics&lt;/strong> dashboard, but with resource usage visualized on a &lt;em>per-group&lt;/em> level rather than a per-user level &amp;#x1f389;&lt;/p>
&lt;h2 id="future-work">
Future work
&lt;a class="header-anchor" href="#future-work">#&lt;/a>
&lt;/h2>&lt;p>We have laid the foundation for joining user group data to usage metrics with Prometheus by extracting memberships from JupyterHub&amp;rsquo;s database. This unlocks potent ways in which observability systems can be extended to group-level reporting and monitoring.&lt;/p>
&lt;p>Future directions for this work include:&lt;/p>
&lt;ul>
&lt;li>visualising cloud cost by user group in Grafana&lt;/li>
&lt;li>developing more group-level reporting and monitoring dashboards&lt;/li>
&lt;li>introducing group-level resource quotas.&lt;/li>
&lt;/ul>
&lt;p>What do you think? How would you like to see JupyterHub&amp;rsquo;s group management features evolve? Have you tried deploying this yourself?
&lt;a href="https://docs.google.com/forms/d/e/1FAIpQLSff-u-sWFuwO1-VTgk2Ir7f1nfUUlLevQk_Vkk_jnmcI1nJnw/viewform?usp=header" target="_blank" rel="noopener" >We welcome your feedback&lt;/a> and feel free to open GitHub issues or make contributions to the repositories mentioned in this post.&lt;/p>
&lt;h2 id="acknowledgements">
Acknowledgements
&lt;a class="header-anchor" href="#acknowledgements">#&lt;/a>
&lt;/h2>&lt;p>Thanks to the
&lt;a href="https://deploy-preview-614--2i2c-org.netlify.app/collaborators/jupyterhub/" >JupyterHub project&lt;/a> for their collaboration and review of this work.&lt;/p></description></item><item><title>Track and manage cloud costs using Grafana</title><link>https://deploy-preview-614--2i2c-org.netlify.app/blog/aws-cost-attribution/</link><pubDate>Fri, 15 Nov 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-614--2i2c-org.netlify.app/blog/aws-cost-attribution/</guid><description>&lt;p>
&lt;figure id="figure-grafana-dashboard-showing-cloud-costs-broken-down-by-compute-storage-and-other-components-for-the-openscapeshttpsopenscapesorg-hub">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Screenshot of a graph showing total daily costs per component." srcset="
/blog/aws-cost-attribution/featured_hu0a1ce7d8654f8efa8d798b6fefc5ebab_212463_55733394a3e42b9cab8734939a78d9bd.webp 400w,
/blog/aws-cost-attribution/featured_hu0a1ce7d8654f8efa8d798b6fefc5ebab_212463_025709b2a5b75f5862165f203ded6cd4.webp 760w,
/blog/aws-cost-attribution/featured_hu0a1ce7d8654f8efa8d798b6fefc5ebab_212463_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-614--2i2c-org.netlify.app/blog/aws-cost-attribution/featured_hu0a1ce7d8654f8efa8d798b6fefc5ebab_212463_55733394a3e42b9cab8734939a78d9bd.webp"
width="760"
height="485"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Grafana dashboard showing cloud costs broken down by compute, storage and other components for the
&lt;a href="https://openscapes.org/" target="_blank" rel="noopener" >Openscapes&lt;/a> hub.
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>We are pleased to unveil a new feature to track cloud costs within our Grafana dashboards! Community Champions now have the ability to monitor the cost and usage of their 2i2c-managed hubs that displays up to date aggregated costs as well as detailed breakdowns for more granular reports.&lt;/p>
&lt;div class="alert alert-note">
&lt;div>
Note that this feature is currently available to AWS hosted hubs only and will be rolled out to other cloud providers in the future.
&lt;/div>
&lt;/div>
&lt;h2 id="accessing-the-cloud-cost-dashboard">
Accessing the cloud cost dashboard
&lt;a class="header-anchor" href="#accessing-the-cloud-cost-dashboard">#&lt;/a>
&lt;/h2>&lt;p>Community Champions can view the Cloud Cost dashboard from their Grafana instance (please see the
&lt;a href="https://docs.2i2c.org/admin/monitoring/grafana-dashboards#getting-a-grafana-account" target="_blank" rel="noopener" >Service Guide&lt;/a> for how to gain access).&lt;/p>
&lt;p>From the main menu of Grafana, navigate to &lt;em>Dashboards &amp;gt; Cloud cost dashboards &amp;gt; Cloud cost attribution&lt;/em> to view the dashboard.&lt;/p>
&lt;h2 id="understanding-the-cloud-cost-dashboard">
Understanding the cloud cost dashboard
&lt;a class="header-anchor" href="#understanding-the-cloud-cost-dashboard">#&lt;/a>
&lt;/h2>&lt;p>A typical 2i2c-managed deployment comprises of a staging hub and a production hub, although some other communities may have extra hubs such as a workshop hub. By default, costs are not broken down on a per hub basis unless the community has opted in to this feature.&lt;/p>
&lt;p>The dashboard is made of several panels:&lt;/p>
&lt;ul>
&lt;li>Daily costs&lt;/li>
&lt;li>Daily costs per hub (opt-in only)&lt;/li>
&lt;li>Total daily costs per component&lt;/li>
&lt;li>Daily costs per component per hub (opt-in only).&lt;/li>
&lt;/ul>
&lt;video mute autoplay loop >
&lt;source src="https://deploy-preview-614--2i2c-org.netlify.app/blog/aws-cost-attribution/demo.mp4" type="video/mp4">
&lt;/video>
&lt;p>For more detailed information on the data that each panel displays, please consult our
&lt;a href="https://docs.2i2c.org/admin/monitoring/cost-users#understanding-the-cloud-cost-dashboard" target="_blank" rel="noopener" >Service Guide&lt;/a> for reference.&lt;/p>
&lt;h2 id="sharing-cost-reports">
Sharing cost reports
&lt;a class="header-anchor" href="#sharing-cost-reports">#&lt;/a>
&lt;/h2>&lt;p>The dashboard can be shared with other community members and stakeholders so they can understand usage and cost patterns. Community Champions can export data to a CSV file, or they can generate a snapshot of the Grafana dashboard and share a public link.&lt;/p>
&lt;p>For instructions on how to export data from the dashboard, please see our
&lt;a href="https://docs.2i2c.org/admin/monitoring/cost-users#sharing-cost-reports" target="_blank" rel="noopener" >Service Guide&lt;/a> for reference.&lt;/p>
&lt;h2 id="next-steps">
Next steps
&lt;a class="header-anchor" href="#next-steps">#&lt;/a>
&lt;/h2>&lt;p>We would love to know whether this feature is useful and how it can be improved. We will be contacting individual communities to share their feedback with us – please share your thoughts with us!&lt;/p>
&lt;p>We will work on rolling out this service to GCP hosted clusters in future. Stay tuned to know when this feature is available to your community.&lt;/p>
&lt;h2 id="acknowledgements">
Acknowledgements
&lt;a class="header-anchor" href="#acknowledgements">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Thank you to Erik for spearheading the rollout effort and to the rest of the 2i2c team for their support.&lt;/li>
&lt;li>Thanks to
&lt;a href="https://deploy-preview-614--2i2c-org.netlify.app/collaborators/openscapes/" >Openscapes&lt;/a> and
&lt;a href="https://deploy-preview-614--2i2c-org.netlify.app/collaborators/cryocloud/" >Cryocloud&lt;/a> communities for providing valuable insights during the prototyping and testing phase, and for funding part of this work.&lt;/li>
&lt;/ul></description></item></channel></rss>