Clipper Changelog

Tags Clipper

Release 2024.05 - May 17, 2024

We are happy to announce the release of Clipper 2024.05. This is a major enhancement/rebuild of the Clipper HPC environment.

Beginning with this release, we will be referring to Clipper’s state with date versioning, updated twice yearly (e.g. 2024.05 and 2024.12). The .05 releases are targeted for large upgrades/changes, while the .12 releases will be for minor patching/updates. This is subject to change/current needs.

BREAKING CHANGES

  • When logging into clipper.gvsu.edu, your SSH client may notify you about a changed host key. This is expected as the login node has changed.
  • Home folders have moved from /home to /mnt/home. This should not affect references to home using ~/. If your scripts use direct /home/<username> references, you must update these paths.
  • Slurm CPU core, memory and GPU allocations are now enforced. You must explicity request the resources you wish to use in your job. Each core requested is allocated 4 GB of memory by default. Each GPU requested is allocated 4 CPU cores by default.
  • Many modules have new names. Existing module load references may need to be updated.

NEW

  • More than 95 percent of the cluster software is now built with Spack. Many new packages have been installed and are available for use.
  • Container support is provided through Apptainer (formerly Singularity) and NVIDIA enroot.
  • Lmod has replaced Environment Modules as the cluster’s module loading system. Lmod provides a compiler/MPI hierarchy for loading modules.
  • Slurm tmpfs plugin has been enabled. Writing to /tmp in a Slurm job will now write to the nodes local /mnt/local SSD filesystem and will be cleaned up automatically at job completion.
  • Slurm GPU sharding has been enabled. You can request a portion (shard) of the GPU’s resources rather than the entire GPU.
  • Node features have been added to Slurm and can be used for requesting specific types of resources.
  • An additional NVIDIA Tesla v100s GPU has been added to G003, matching the configuration of G001, G002 and G004. There are now eight NVIDIA Tesla v100s GPUs available for use.
  • There are now eight CPU nodes in total. All CPU nodes have been upgraded to 768 GB RAM.
  • All GPU nodes have been upgraded to 384 GB RAM.
  • A 150TB, all SSD storage appliance has been installed. The storage is available as /mnt/scratch.
  • Four additional GPU nodes, G005-G008, have been added to the cluster. Each node has two NVIDIA Quadro RTX8000 GPUs. (Please note - due to an issue with a RAM module, G006 has slightly less RAM than the other seven systems at the moment.)

UPDATES

  • Operating systems have been updated from Red Hat Enterprise Linux 9.2 to 9.4.
  • OS kernel has been updated from 5.14.0-162.6.1.el9_1.x86_64 to 5.14.0-362.8.1.el9_3.x86_64.
  • NVIDIA/Mellanox OFED driver has been updated from 5.8-2.0.3 to 24.01-0.3.3.
  • NVIDIA driver has been updated from 535.129.03 to 545.23.08.
  • Default CUDA version has been updated from 12.1 to 12.3. It has been consolidated into a single module. CUDA 11.8 is also available through the modules system.
  • Slurm updated from 21.08.8 to 23.11.6.
  • The ml-python module has been re-installed with the latest versions of all software.
  • All system firmware has been updated to the latest Dell releases.
  • The NVIDIA/Mellanox QM8700 Infiniband switch has been updated to the latest firmware.

DEPRECATIONS

  • The ml-python module name has been deprecated in favor of py-venv-ml and will be removed in a future release. Users will be notified of the deprecation upon loading ml-python. We recommend updating all Slurm scripts to use the py-venv-ml module name.
  • /active file system has been deprecated and re-located to /mnt/projects. A symbolic link has been setup but will be removed in a future release. We recommend updating all references to /active to /mnt/projects.
  • /archive file system has been deprecated and re-located to /mnt/archive. A symbolic link has been setup but will be removed in a future release. We recommend updating all references to /archive to /mnt/archive.

REMOVALS

  • Bright Cluster Manager has been replaced by in-house Ansible playbooks. This change allows for expansion of the cluster without additional licensing cost. Additionally, it allows additional support personel from Enterprise Architecture to participate in system operations.
  • Many standalone Python virtual environment modules (numpy, scipy, etc.) have been removed in favor of Spack packages (py-numpy, py-scipy, etc.). Spack Python packages provide better interoperability and compatibility than the previous standalone environments.

KNOWN ISSUES

  • The MPICH MPI library is not compiled with NVIDIA/Mellanox OFED Infiniband support and will instead use the cluster's Ethernet network for MPI communications. GVSU ARC recommends using OpenMPI or Intel MPI instead of MPICH as a workaround.
Was this helpful?
0 reviews
Print Article

Details

Article ID: 19614
Created
Fri 5/17/24 10:25 AM
Modified
Mon 9/16/24 9:18 AM