Assessing and Enhancing CC-Snapshot for Reproducible Experiment Environments

Overview

CC-Snapshot is a tool on the Chameleon testbed that enables users to “capture” their customized environments and package them as complex images or appliances. By allowing researchers to share these environments easily, CC-Snapshot offers a powerful mechanism for reproducibility, ensuring that experiments can be replicated and extended by others.

In this project, you will review existing CC-Snapshot workflows, research the latest snapshotting technologies, and develop enhancements that improve the tool’s usability and reliability. This includes ensuring snapshots are created consistently (even when the OS is actively running), preserving the integrity of user systems, and exploring advanced features such as out-of-band snapshotting and API-based triggers.

Key Outcomes

  • Improved Snapshot Consistency: New methods to capture the full state of a disk without risking corruption or data inconsistency.
  • Enhanced Reproducibility: A refined workflow that allows researchers to reliably share custom environments, facilitating collaborative and repeatable experiments.
  • User-Friendly Tooling: Streamlined processes that reduce disruption to running systems—so installing dependencies or rebooting into special environments is less burdensome.
  • Exploratory Features (Stretch Goals): Advanced mechanisms to stream disk data in real time during snapshotting and to initiate snapshots via an API call (for parity with VM snapshots).

Topics: Cloud Computing, Systems & Infrastructure, Reproducibility, Operating System Internals

Skills

  • Linux / OS Concepts: Familiarity with disk partitioning, filesystems (e.g., LVM), ramdisks, kexec, etc.
  • Cloud Tools: Experience with OpenStack or Ironic can be beneficial.
  • Systems Programming / Scripting: Proficiency in Python or Bash for automation and tooling.
  • DevOps / CI: Understanding of in-band vs. out-of-band workflows, as well as best practices for building reproducible environments.

Difficulty: Moderate to Hard

Size: Medium to Large

Mentors

Mark Powers (markpowers@uchicago.edu)

Mark Powers is a research software engineer at the University of Chicago, and is the DevOps lead for the Chameleon Cloud testbed. His research interests focus on cloud and edge computing, system design, and reproducibility. Since 2021, he has assisted in mentoring several student interns each summer.

Mike Sherman (shermanm@uchicago.edu)

Michael Sherman is the Infrastructure Lead for Chameleon Cloud. Mike’s research interests focus on the reliability of large systems at all levels – computing, networking, and human interaction. His recent work centers around tools to enable the reconfigurability of networks, bare-metal servers, and edge devices. Prior to this, Mike helped develop the ORBIT and COSMOS testbeds, working on reproducible experiments in wireless and edge computing. Each summer since 2015, he has co-mentored interns, ranging from high-school to graduate students, a total of roughly 40 students. Michael will mentor student working on the CHI@Edge project.

Potential Tasks & Deliverables

  • Ensure Snapshot Consistency
      • Reboot into a ramdisk and copy the offline disk.
      • Use kexec to switch to/from a ramdisk environment without a full reboot.
      • Change images to use a snapshot-capable filesystem (e.g., LVM) for safer live snapshots.
      • Investigate additional methods (e.g., blog.benjojo.co.uk) for safely imaging live disks.
  • Prevent System Modifications During Snapshot
      • Currently, CC-Snapshot installs dependencies (e.g., qemu-img) on the running system, affecting its state.
      • In-Band Fix: Download and run tools in a temp directory with static linking, avoiding system-level changes.
      • Out-of-Band Approach: Snapshots done via ramdisk or kexec do not require altering the running system.
  • API-Triggered Snapshots
      • Extend or integrate with the Nova “snapshot instance” API to support the same workflow for bare metal.
      • Leverage Ironic’s new “service steps” feature for an automated snapshot pipeline.
  • (Stretch Goal) Streaming Snapshots
    • Modify the workflow to stream data directly to storage, rather than making a full local copy first.
    • Explore incremental or differential snapshot techniques to reduce bandwidth usage and storage overhead.