Sarah Gibson | 2i2c

Enforcing per-user storage quotas now available on GCP

Tue, 25 Feb 2025 14:18:04 +0000

Building upon our previous work developing per-user storage quotas for our AWS infrastructure, we are pleased to announce that this feature is now available for GCP-hosted hubs!

To provide this feature on this vendor, we have updated our infrastructure provisioning system to create persistent disks, and enable automatic backups of the disk for disaster recovery purposes. However, the systems we had already developed for AWS, such as jupyterhub-home-nfs and our alerting system through Prometheus Alertmanager, are vendor agnostic and work right out of the box with the new architecture!

If you would like to try this feature on your 2i2c-managed JupyterHub, please get in touch.

Acknowledgements #

This project was developed and deployed in collaboration with Tarashish Mishra from Development Seed, funded through the NASA VEDA project.

Announcing backups for GCP-hosted hubs!

Fri, 07 Feb 2025 13:08:22 +0000

2i2c are pleased to announce the development and deployment of automated backups of home directories on GCP-hosted hubs!

We have developed the gcp-filestore-backups project that regularly creates backups of JupyterHub home directories for disaster recovery purposes. The project is a Python wrapper around the gcloud tool to regularly request backups be made of the Filestore hosting JupyterHub’s user home directories, by default on a daily basis. The script also manages retention of these backups by checking how recently the last backup was made, and the age of existing backups, by default deleting any backup older than 5 days.

Having these backups enabled means that, in the unlikely and unfortunate case of data loss or corruption, we can reinstate the home directories of the hub to a relatively recent state that is at a maximum of 1 day prior to the incident.

We have deployed gcp-filestore-backups to all our GCP hubs presently running, with a retention period of 2 days. If you would like to discuss this further with us, please get in touch!

As ever, this project has been developed openly in line with our Right to Replicate so you can deploy it against your own infrastructure!

Enforcing per-user storage quotas with `jupyterhub-home-nfs`

Tue, 28 Jan 2025 09:57:28 +0000

When sharing a storage disk between users, as is usually the case in a JupyterHub deployment, it is important to put in guardrails so that one user cannot eat up the whole storage capacity from the rest of the users. To this end, 2i2c in close collaboration with Development Seed have developed the jupyterhub-home-nfs project which is a Helm chart that permits enforcing per-user quotas on the storage space.

Note that this feature is currently available to AWS hosted hubs only and will be rolled out to other cloud providers in the future.

Under the hood, the Helm chart runs NFS Ganesha as an in-cluster NFS server, backed by XFS as the underlying filesystem. Storage quota is enforced through XFS’s native quota management utility xfs_quota.

Since this feature moves our infrastructure away from managed filesystems (such as AWS’s Elastic File System) that cannot support per-user storage quotas, we have also developed monitoring and alerting mechanisms that will let us know when the disks are getting full, and automated back-ups for disaster recovery.

If you would like to try this on your 2i2c-managed hub, please get in touch.

This project can also be used with any Kubernetes-based JupyterHub, as per our Right to Replicate policy, so please try it out on your own deployment and let us know what you think!

Acknowledgements #

This project was developed and deployed in collaboration with Tarashish Mishra from Development Seed, funded through the NASA VEDA project.

Tech update: Multiple JupyterHubs, multiple clusters, one repository.

Tue, 19 Apr 2022 00:00:00 +0000

2i2c manages the configuration and deployment of multiple Kubernetes clusters and JupyterHubs from a single open infrastructure repository. This is a challenging problem, as it requires us to centralize information about a number of independent cloud services, and deploy them in an efficient and reliable manner. Our initial attempt at this had a number of inefficiencies, and we recently completed an overhaul of its configuration and deployment infrastructure.

This post is a short description of what we did and the benefit that it had. It covers the technical details and provides links to more information about our deployment setup. We hope that it helps other organizations make similar improvements to their own infrastructure.

Our problem #

2i2c’s problem is similar to that of many large organizations that have independent sub-communities within them. We must centralize the operation and configuration of JupyterHubs in order to boost our efficiency in developing and operating them, but must also treat these hubs independently because their user communities are not necessarily related, and because we want communities to be able to replicate their infrastructure on their own.

A year ago, we built the first version of our deployment infrastructure at github.com/2i2c-org/infrastructure. Over the last year of operation, we identified a number of major shortcomings:

Within a Kubernetes cluster, we deployed hubs sequentially, not in parallel. This grew out of a common practice of Canary deployments that allowed us to test changes on a staging hub before rolling them out to a production hub.
We used a single configuration file for all hubs within a cluster, which led to confusion and difficulty in identifying a hub-specific configuration.
Moreover, any change to a hub within a cluster caused a re-deploy of all hubs on that cluster. This is because we did not know whether a given change touched cluster-wide configuration or hub-specific configuration.

Our goal #

So, we spent several weeks discussing a plan to resolve these major problems - here were our goals:

We should be able to upgrade a specific hub alone, by inspecting which configuration files have been added or modified.
Production hubs should be upgraded in parallel when they are effectively run independently.
We should use staging hubs as “canary” deployments and not continue upgrading production hubs if the staging hub fails.

An overview of our changes #

To accomplish this, we needed to identify which hub required an upgrade based on file additions/modifications. This took a lot of discussion and iteration on design, and so we share it below in the hopes that it is helpful to others!

Improvements to our code and structure #

We made a few major changes to the infrastructure repository to facilitate the deployment logic described above. Here are the major changes we implemented:

We separated each hub’s configuration into its own file, or set of files. For example, here is 2i2c’s staging hub configuration.
We created a separate cluster.yaml file that holds the canonical list of hubs deployed to that cluster and the configuration file(s) associated with each one. For example, here is 2i2c’s GKE cluster configuration, which contains a reference to the previously mentioned staging hub.
We updated our deployer module to do the following things:
- Inspect the list of files modified in a Pull Request.
- From this list, calculate the name of a hub that required an upgrade, and the name of its respective cluster.
- Trigger a GitHub Actions workflow that deploys changes in parallel for each cluster/hub pair.

In addition to these structural and code changes, we also developed new GitHub Actions workflows that control the entire process.

A GitHub Actions workflow for upgrading our JupyterHubs #

We defined a new GitHub Actions workflow that carries out the logic described above. These are all defined in this deploy-hubs.yaml configuration file. Here are the major jobs in this workflow, and what each does:

generate-jobs: Generate a list of clusters/hubs that must be upgraded, given the files that are changed in a Pull Request.
- Evaluate an input list of added/modified files in a PR
- Decide if the added/modified files warrant an upgrade of a hub
- Generate a list of hubs and clusters that require upgrades, and some extra details:
  - Does the support chart that is deployed to the cluster also need an upgrade?
  - Does a staging hub on this cluster require an upgrade?
This produced two outputs to be used in subsequent steps:
- A human-readable table including information on why a given deployment requires an upgrade (using the excellent Rich library).
- JSON outputs that can be interpreted by GitHub Actions as sets of matrix jobs to run.
Our staging and support hub job matrix tells GitHub Actions to deploy staging and support upgrades that act as canaries and stop production deploys if they fail.
upgrade-support-and-staging: Update the support and staging Helm charts on each cluster. These are “shared infrastructure” Helm charts that control services that are shared across all hubs.
- Accepts the JSON list described above to determine what to do next
- Parallelises over clusters
- Upgrades the support chart of each if required
- Upgrades a staging hub for the cluster if required (for canary deployments, this is always required if at least one production hub is to be upgraded on the cluster)
filter-generate-jobs: Allows us to treat the support / staging hubs as canary deployments for all the production hubs on a cluster.
- If a staging/support hub deploy fails, removes any jobs for the corresponding cluster.
- Allows production deploys to continue on other clusters.
Our production hub job matrix tells GitHub Actions which hubs to update with new changes. These are triggered if a cluster’s staging/support job does not fail.
upgrade-prod-hubs: Deploy updates to each production hub.
- Accepts the JSON list described above to determine what to do next
- Parallelises over each production hub that requires an upgrade
- Deploy the relevant changes to that hub

Concluding Remarks #

We think that this is a nice balance of infrastructure complexity and flexibility. It allows us to separate the configuration of each hub and cluster, which makes each more maintainable by us, and is more aligned with a community’s Right to Replicate their infrastructure. It allows us to remove the interdependence of deploy jobs that do not need to be dependent, which makes our deploys more efficient. Finally, it allows us to make targeted deploys more effectively, which reduces the amount of toil and unnecessary waiting associated with each change. (It also reduces our carbon footprint by reducing unnecessary GitHub Action time).

We hope that this is a useful resource for others to follow if they also maintain JupyterHubs for multiple communities. If you have any ideas of how we could further improve this infrastructure, please reach out on GitHub! If you know of a community that would like 2i2c to manage a hub for your community, please send us an email.

Acknowledgements: The infrastructure described in this post was developed by the 2i2c engineering team, and this post was edited by Chris Holdgraf.

Pangeo Cloud goes live on 2i2c!

Tue, 16 Nov 2021 00:00:00 +0000

Pangeo Cloud is an experimental service providing public cloud-based data-science environments for data-intensive geoscience research. We have recently finished re-creating the Pangeo community JupyterHub hosted on GCP in the 2i2c-org/infrastructure repository. This is a huge milestone in our partnership with Pangeo to provide expertise and operations of cloud-based, vendor-agnostic Jupyter infrastructure and workflows.

For users of Pangeo Cloud, the switch should have been a smooth one! The new hub should behave nearly identically to the old one, and will be managed by 2i2c engineers moving forward, in partnership with the Pangeo community. It will be available at the same URL ( us-central1-b.gcp.pangeo.io) and there’s no need to worry about your home directories, they were synced to the new hub only a few days before the migration took place. Development and operations on this hub will all be done in the open and we invite participation and feedback from others in our infrastructure work. Please see this Discourse thread as an initial place to provide feedback.

On 22nd November 2021, the old Pangeo GCP JupyterHub will be shut down, and the project will move forward on the new 2i2c Pangeo Hub. Moving forward, we plan to collaborate together in order to find new pathways for development in the Jupyter ecosystem - we will share more ideas of things we will work on soon!

History of Pangeo Cloud Hubs #

Pangeo has pioneered a new model in using open source and cloud-agnostic infrastructure to support scientific research in the cloud.

The first Pangeo cloud JupyterHub (pangeo.pydata.org; now defuct) was deployed for the 2017 American Meteoroligical Society Meeting; since then, the Pangeo community has iterated through several different versions of prototype cloud-based hubs. This allowed for many new workflows that enabled a more open and collaborative pathway to doing world class research, and included access to datasets and computational resources that were previously unattainable. Pangeo achieved this by working in partnership with open source communities and building technology that leveraged modular open source components for their platform.

In the last several years, Pangeo have built a thriving community of practice around this infrastructure. However as the community has grown, so has the need for more reliable and dedicated operational and developmental support since parts of the Pangeo stack require dedicated expertise and attention to managed. Modern scalable cloud infrastructure is one example of this. Maintaining a complex JupyterHub with many users is a difficult task, and has required significant resources from the Pangeo Project up to this point.

The Pangeo-2i2c Partnership #

2i2c is a non-profit team that develops and operates cloud infrastructure for interactive computing workflows. We have extensive experience in Jupyter workflows in the cloud and a long history of contributions to projects in this ecosystem. We have built a cloud deployment management system that allows us to centralise and configure the deployment of many independent JupyterHubs, empowering communities to leverage the same infrastructure (and team!) for JupyterHubs running in the cloud.

Similarly to Pangeo, all of 2i2c’s core infrastructure is cloud- and vendor-agnostic, and follows a model of building open source tools and giving back to those communities. Our partnership with Pangeo began through 2i2c’s core competency in these areas and the similarity between the two project’s technical stacks.

We’ve begun a partnership whereby 2i2c will manage Pangeo’s cloud infrastructure and lead efforts to develop new features, in partnership with open source communities. We sketched out a few ideas to focus on in this kick-off thread on Discourse. This approach allows each community to focus on it’s core strengths: Pangeo will continue to grow an open community and scientific software ecosystem around geospatial analytics, and 2i2c will oversee the development and operations of the core cloud infrastructure stack that powers Pangeo’s workflows. In some areas we are still experimenting with different collaboration models to ensure that the needs of the Pangeo community are met in a way that is also sustainable for 2i2c. Over the coming weeks, you may see some conversations (and threads for feedback!) about different support and operations models that work best for the community. We are excited to use this as an opportunity to learn more about how to serve more complex and diverse communities like Pangeo.

Acknowledgements #

We are extremely grateful to the Pangeo project for giving us the opportunity to serve their community, as well as the Moore Foundation for funding this work. We look forward to a long partnership ahead! 🚀

Sarah Gibson

Mon, 01 Jan 0001 00:00:00 +0000

Sarah Gibson was an Open Source Infrastructure Engineer at 2i2c. She is an open source contributor and advocate. She holds more than two years of experience as a Research Engineer at a national institute for data science and artificial intelligence, as well as holding a core contributor role in the open source projects Binder, JupyterHub, and The Turing Way. Sarah is passionate about working with domain experts to leverage cloud computing in order to accelerate cutting-edge, data-intensive research and disseminating the results in an open, reproducible and reusable manner.

Sarah holds a Fellowship with the Software Sustainability Institute and advocates for best software practices in research. She is a member of the mybinder.org operating team and maintains infrastructure supporting over 150k launches of reproducible computational environments per week. She has also mentored projects through two cohorts of the Open Life Science programme, imparting lived experience of her skills participating and leading in open science projects.