Upgrade to Fedora 38 with cuda-12.4

Upgrading to Fedora 38 from Fedora 36 with Nvidia cuda-12.4

NB: Here we're emphasising being careful about cuda versions - in particular we want to get cuda-12-4, since that's better for lamma.cpp since the Fedora gcc versions are higher than expected by previous versions of cuda.

Standard Fedora 36 → 38 upgrade steps

dnf install dnf-plugin-system-upgrade --best

# Large download (to latest Fedora 36) :
dnf upgrade --refresh
# Takes several minutes, depending on whether you update regularly
shutdown -r now

dnf system-upgrade download --releasever=38
# Takes 30mins?
dnf system-upgrade reboot
# Takes 30mins?


# Collect useful stats:
uname -r;
rpm -q --queryformat '%{name}\t: %{version}\n' xorg-x11-drv-nvidia;
rpm -q --queryformat '%{name}\t\t\t: %{version}\n' cuda;
rpm -q --queryformat 'cudnn\t\t\t: %{version}\n' libcudnn8;
rpm -q --queryformat '%{name}\t: %{version}\n' google-chrome-stable;

nvidia-smi
# If this gives nice output : Then we are done
# NB: It might say '12.2' as the cuda version at the top
#     But the installed rpm is (likely 11.8) : Which *seems* to match reality

If there's no `cuda` library installed...

# NB: This is for a later version of Fedora (39) but is confirmed to work
REPO_BASE=https://developer.download.nvidia.com/compute/cuda/repos
dnf config-manager \
  --add-repo ${REPO_BASE}/fedora39/x86_64/cuda-fedora39.repo
dnf install cuda # We get 12.4

# Clean up previous cuda versions according to taste:
dnf remove cuda-11-7
dnf remove cuda-11-8
dnf remove cuda-12-0
dnf remove cuda-12-1
dnf remove cuda-12-3

# The cuda should show up with the stats lines:
rpm -q --queryformat '%{name}\t: %{version}\n' xorg-x11-drv-nvidia;
rpm -q --queryformat '%{name}\t\t\t: %{version}\n' cuda;
rpm -q --queryformat 'cudnn\t\t\t: %{version}\n' libcudnn8;

nvidia-smi
# If this gives nice output : Then we are done - look at PyTorch install next

Installing & Testing PyTorch

Do this once cuda is installed and working according to nvidia-smi. If the previous part didn't work, skip this section and continue below : Come back here later!

We also want PyTorch installed, which requires PyTorch with cu124 :

python -mvenv env311
 . env311/bin/activate
pip install -U pip
pip install --pre torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/nightly/cu124

# And now test it:
python
>>> import torch
>>> #... no error displayed

Supposing PyTorch now works : All done!

Relink the kernel module...

(continue here if the nvidia-smi looks bad above)

# If nvidia-smi is failing :
lsmod | grep nv
# Probably empty (except for i2c_nvidia_gpu)

journalctl -b | grep nvidia
# Check the journal kernel lines mention stuff like : modprobe.blacklist=nouveau

Apparently, this may be a something that doesn't work quite right in what akmods produces:

depmod -a
shutdown -r now

nvidia-smi
# If this gives nice output : Then we are done - look at PyTorch install above

Rebuild the kernel module...

Apparently, normal updating does give enough time for akmods to complete, so let's do it manually here.

NV_KMOD=`rpm -qa | grep kmod | grep $(uname -r)`
echo $NV_KMOD

dnf remove $NV_KMOD
akmods --force --kernels $(uname -r)
# Takes a couple of minutes
shutdown -r now

nvidia-smi
# If this gives nice output : Then we are done - look at PyTorch install above

If the journal still mentions starting 'nvidia-fallback.service' ...

Perhaps the service is falling back to nouveau in before the nvidia module loads properly:

systemctl disable nvidia-fallback.service
systemctl mask nvidia-fallback.service
shutdown -r now

nvidia-smi
# If this gives nice output : Then we are done - look at PyTorch install above

All done!