Release 2024.12 - Dec. 20, 2024
We are happy to announce the release of Clipper 2024.12.
KEY UPDATES
- Red Hat kernel updated from 5.14.0-362.8.1 to 5.14.0-427.42.1.
- NVIDIA driver updated from 545 to 565.
- Mellanox OFED Infiniband drivers have been deprecated by NVIDIA and replaced by NVIDIA DOCA Infiniband drivers.
- Slurm job scheduler updated from 23.11.4 to 24.05.4.
- All system firmware has been updated to the latest Dell releases.
- The NVIDIA/Mellanox QM8700 Infiniband switch has been updated to the latest firmware.
NEW
FIXES
- Fixed an issue where GPU statistics would no longer be reported after 65 jobs were submitted to a single GPU node.
- Fixed an issue with MPICH not utilizing the Infiniband network for MPI communication.
BREAKING CHANGES
- Jobs submitted to the gpu queue must now always request a gpu resource (using
--gres=gpu:1
or similar). Jobs that do not request a gpu resource will be rejected by the scheduler. See Using GPUs on Clipper for more information on requesting GPU resources.
- The
/archive
filesystem mount point, which had been previously deprecated, has been removed. Please update any references to /mnt/archive
.
- The
/active
filesystem mount point, which had been previously deprecated, has been removed. Please update any references to /mnt/projects
.
- The py-venv-ml/nightly module has been removed. It is recommended to install a personal virtual environment instead.
- The ml-python module, which had been previously deprecated, has been removed. Use py-venv-ml instead.
OTHER CHANGES
- A single user can now allocate only 75 percent of any partition’s nodes at once per the resource allocation policy.
- The system default compiler has been changed to gcc version 13. All provided modules have been re-compiled against this version of gcc.
- Module files have been updated to better align with the programs accessible through the system’s default PATH environment variable. For instance, Python 3.9 is a system-provided package, acting as the default Python interpreter when you log in. It is now marked as a sticky module and is “loaded” by default on login to reflect this default behavior. To use a different Python version (e.g., a newer release), simply load the corresponding module. This will replace Python 3.9 in your environment’s PATH.
KNOWN ISSUES
- Due to compilation issues, fftw, hdf5, intel-oneapi-mkl, petsc, and slepc are not currently available as system modules for the MPICH MPI library. GVSU ARC recommends selecting an alternative MPI family.
NEW PACKAGES
- gcc 14.2.0
- julia 1.11.2
- openjdk 17.0.11
- openjdk 21.0.3
- petsc 3.22.1
- python 3.10.14
- slepc 3.22.1
- virtualgl 3.1.1
PACKAGE UPDATES
- apptainer updated from 1.3.0 to 1.3.5
- awscli-v2 updated from 2.13.22 to 2.15.53
- boost updated from 1.85 to 1.86
- Module-provided cmake updated from 3.27.9 to 3.30.5 (System provided cmake remains at 3.26.5)
- cuda 12 updated from 12.3.0 to 12.6.0
- Module-provided curl updated from 8.6.0 to 8.10.1 (System provided curl remains at 7.76.1)
- enroot updated from 3.4.1 to 3.5.0
- System-provided gcc 11 updated from 11.4.1 to 11.5.0
- gcc 12 is now provided by Red Hat’s AppStream repository instead of being built from source. The version is now 12.2.1.
- gcc 13 is now provided by Red Hat’s AppStream repository instead of being built from source. The version is now 13.3.1.
- gdb updated from 14.1 to 15.2
- gh updated from 2.43.1 to 2.58.0
- go updated from 1.22.2 to 1.23.2
- hdf5 updated from 1.14.3 to 1.14.5
- hwloc updated from 2.10.0 to 2.11.2
- intel-oneapi-compilers updated from 2024.1.0 to 2025.0.0
- intel-oneapi-mkl updated from 2024.0.0 to 2024.2.2
- intel-oneapi-mpi updated from 2021.12.1 to 2021.14.0
- julia updated from 1.10.2 to julia 1.10.7
- miniconda3 updated from 24.1.2 to 24.9.2
- mpich updated from 4.2.1 to 4.2.3
- nvhpc updated from 24.3 to 24.9
- openblas updated from 0.3.25 to 0.3.28
- openjdk11 updated from 11.0.20.1 to 11.0.23
- openmpi updated from 5.0.3 to 5.0.5
- osu-micro-benchmarks updated from 7.4 to 7.5
- perl updated from 5.38.0 to 5.40.0 (System provided perl remains at 5.32.1)
- pmix updated from 5.0.2 to 5.0.4
- py-cython updated from 3.0.8 to 3.0.11
- py-mpi4py updated from 3.1.5 to 4.0.1
- py-pandas updated from 2.1.4 to 2.2.3
- py-scikit-learn updated from 1.4.2 to 1.5.2
- py-scipy updated from 1.11.3 to 1.13.1
- py-torchstack updated from 2024.05 to 2024.12 (contains many updated pytorch packages)
- py-venv-ml updated from 2024.05 to 2024.12 (contains many updated ai/ml packages)
- System-provided python3.9 updated from 3.9.18 to 3.9.19
- python3.11 updated from 3.11.7 to 3.11.9
- python3.12 updated from 3.12.1 to 3.12.5
- r updated from 4.4.0 to 4.4.1 (all provided libraries have been updated and compiled against this new version)
- rust updated from 1.78 to 1.81
- ucx updated from 1.16 to 1.18
Sept. 16, 2024
- Podman is now available in certain use cases to execute container images that may not work with Apptainer. Please contact GVSU ARC as there is additional setup needed to enable Podman for each user.
Release 2024.05 - May 17, 2024
We are happy to announce the release of Clipper 2024.05. This is a major enhancement/rebuild of the Clipper HPC environment.
Beginning with this release, we will be referring to Clipper’s state with date versioning, updated twice yearly (e.g. 2024.05 and 2024.12). The .05 releases are targeted for large upgrades/changes, while the .12 releases will be for minor patching/updates. This is subject to change/current needs.
BREAKING CHANGES
- When logging into clipper.gvsu.edu, your SSH client may notify you about a changed host key. This is expected as the login node has changed.
- Home folders have moved from
/home
to /mnt/home
. This should not affect references to home using ~/
. If your scripts use direct /home/<username>
references, you must update these paths.
- Slurm CPU core, memory and GPU allocations are now enforced. You must explicity request the resources you wish to use in your job. Each core requested is allocated 4 GB of memory by default. Each GPU requested is allocated 4 CPU cores by default.
- Many modules have new names. Existing
module load
references may need to be updated.
NEW
- More than 95 percent of the cluster software is now built with Spack. Many new packages have been installed and are available for use.
- Container support is provided through Apptainer (formerly Singularity) and NVIDIA enroot.
- Lmod has replaced Environment Modules as the cluster’s module loading system. Lmod provides a compiler/MPI hierarchy for loading modules.
- Slurm tmpfs plugin has been enabled. Writing to
/tmp
in a Slurm job will now write to the nodes local /mnt/local
SSD filesystem and will be cleaned up automatically at job completion.
- Slurm GPU sharding has been enabled. You can request a portion (shard) of the GPU’s resources rather than the entire GPU.
- Node features have been added to Slurm and can be used for requesting specific types of resources.
- An additional NVIDIA Tesla v100s GPU has been added to G003, matching the configuration of G001, G002 and G004. There are now eight NVIDIA Tesla v100s GPUs available for use.
- There are now eight CPU nodes in total. All CPU nodes have been upgraded to 768 GB RAM.
- All GPU nodes have been upgraded to 384 GB RAM.
- A 150TB, all SSD storage appliance has been installed. The storage is available as
/mnt/scratch
.
- Four additional GPU nodes, G005-G008, have been added to the cluster. Each node has two NVIDIA Quadro RTX8000 GPUs.
UPDATES
- Operating systems have been updated from Red Hat Enterprise Linux 9.2 to 9.4.
- OS kernel has been updated from 5.14.0-162.6.1.el9_1.x86_64 to 5.14.0-362.8.1.el9_3.x86_64.
- NVIDIA/Mellanox OFED driver has been updated from 5.8-2.0.3 to 24.01-0.3.3.
- NVIDIA driver has been updated from 535.129.03 to 545.23.08.
- Default CUDA version has been updated from 12.1 to 12.3. It has been consolidated into a single module. CUDA 11.8 is also available through the modules system.
- Slurm updated from 21.08.8 to 23.11.6.
- The ml-python module has been re-installed with the latest versions of all software.
- All system firmware has been updated to the latest Dell releases.
- The NVIDIA/Mellanox QM8700 Infiniband switch has been updated to the latest firmware.
DEPRECATIONS
- The ml-python module name has been deprecated in favor of py-venv-ml and will be removed in a future release. Users will be notified of the deprecation upon loading ml-python. We recommend updating all Slurm scripts to use the py-venv-ml module name.
/active
file system has been deprecated and re-located to /mnt/projects
. A symbolic link has been setup but will be removed in a future release. We recommend updating all references to /active
to /mnt/projects
.
/archive
file system has been deprecated and re-located to /mnt/archive
. A symbolic link has been setup but will be removed in a future release. We recommend updating all references to /archive
to /mnt/archive
.
REMOVALS
- Bright Cluster Manager has been replaced by in-house Ansible playbooks. This change allows for expansion of the cluster without additional licensing cost. Additionally, it allows additional support personel from Enterprise Architecture to participate in system operations.
- Many standalone Python virtual environment modules (numpy, scipy, etc.) have been removed in favor of Spack packages (py-numpy, py-scipy, etc.). Spack Python packages provide better interoperability and compatibility than the previous standalone environments.
KNOWN ISSUES
- The MPICH MPI library is not compiled with NVIDIA/Mellanox OFED Infiniband support and will instead use the cluster's Ethernet network for MPI communications. GVSU ARC recommends using OpenMPI or Intel MPI instead of MPICH as a workaround.