Nvidia (11.3) for TensorFlow & PyTorch while upgrading Fedora 34

Upgrading to Fedora 34 from Fedora 32 (or Fedora 33)

(while being careful about `cuda` versions)

Suppose you already have a working Fedora 32 (or 33) machine, and want to upgrade : As always, the Nvidia drivers are going to be painful!

Once again, the negativo repo wants to keep Fedora too up-to-data (latest cuda is v11.4) for current builds of TF and PyTorch.

Fortunately, cuda v11.3 works (confirmed) for TF and PyTorch, so we can use the negativo repo for everything, if we are careful how it gets installed.

Install the negativo repo

Get the negativo Nvidia repo if you haven't already :

dnf config-manager --add-repo=http://negativo17.org/repos/fedora-nvidia.repo

And (only if you need it for display ahead of the upgrade) install the nvidia driver :

dnf install nvidia-driver

Remove `cuda` ahead of the upgrade

NB: Assuming we're still on Fedora 32 (or 33).

If you're currently below Fedora 32, you'll need to upgrade to Fedora 32 before going further - it seems like more than 2 steps is too big a gap before FC34.

Remove the current cuda (since the upgrade process will try and upgrade to the latest, which is 'too far'):

dnf remove nvidia-driver-cuda nvidia-driver-cuda-libs cuda cuda-cudnn

Standard Fedora 34 upgrade steps

dnf install dnf-plugin-system-upgrade --best

# Large download (to latest Fedora 32) :
dnf --refresh upgrade

shutdown -r now
# Reboots back into latest Fedora 32

# Very Large download (to latest Fedora 34)
dnf system-upgrade download --refresh --releasever=34

# Hold-your-breath reboot (will install 0%-100% hands-free ~45mins) :
dnf system-upgrade reboot

# Reboots via Fedora 32 into 'upgrade counter'
# Then reboots into new Fedora 34

Add back the correct `cuda` after the upgrade

See the rpm package lists to see where the version number information came from.

Install the correct cuda packages:

dnf install nvidia-driver-cuda  # includes nvidia-smi
dnf install cuda-11.3.0 cuda-devel-11.3.0
dnf install cuda-cudnn-devel-8.2.0.53-1.fc34.x86_64

Now you should be able to check :

rpm -qa | grep cuda
nvidia-smi

Surprisingly, even though the rpms should all be 11.3, the nvidia-smi mentions 11.4 at the top. This seems normal.

Stop `cuda` getting upgraded by mistake

To /etc/dnf/dnf.conf add:

exclude=cuda*

Python installation using virtual environments

The basic virtual environment creation (NB: the old environment won't work any more, so move it away to ~/DELETE-ME_env38, for instance) :

python3.9 -m venv env39
. env39/bin/activate
pip install --upgrade pip

Add TensorFlow (simple, for once!)

Still in env39 :

pip install tensorflow

And test out the installation using a Python CLI instance:

import tensorflow as tf
tf.test.gpu_device_name() # Deprecated, sadly
#  NVIDIA GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2

tf.config.list_physical_devices('GPU')
#  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

tf.debugging.set_log_device_placement(True)

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], shape=[2, 3], name='a')
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]], shape=[3, 2], name='b')
c = tf.matmul(a, b)

print(c.device)  # Hope for : /job:localhost/replica:0/task:0/device:GPU:0

Add PyTorch (more intricate)

For PyTorch, we must resort to messing with the cuda versions, though this should work, since Nvidia talks about backward compatibility post 11.1 (and this installation method does seem to work) :

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

And test out the installation using a Python CLI instance:

import torch
torch.cuda.is_available()
#  True

#dtype = torch.FloatTensor  # Use this to run on CPU
dtype = torch.cuda.FloatTensor # Use this to run on GPU

a = torch.Tensor( [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]).type(dtype)
b = torch.Tensor( [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]).type(dtype)
c = a.mm(b)

print(c)  # matrix-multiply (should state : device='cuda:0')
print(c.device)  # Hope for : 'cuda:0'

And add a finicky extra, if needed:

pip install pytorch3d -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py39_cu111_pyt190/download.html

Useful extra ML `virtualenv` installs

# Essentials
pip install jupyter matplotlib

# Good for removing cell outputs prior to git commits
pip install nbstripout

# Could be handy
pip install opencv-python ffmpeg-python

NB: Crypto changes during upgrade

Fedora strengthened its cryptographic key standards in the move to Fedora 33 (and beyond), which essentially means that the default key type previously generated (ssh-rsa) isn't acceptable any more.

See this explanation for details.

This means that any key exchange with an upstream server where you have created an authorized_user using a ssh-rsa key won't work any more unless you :

Either downgrade the certificate exchange process on your local (upgraded) machine;
Or:
- Create a new (better) certificate on your local machine; and
- Upload it to the server

The simplest way to do this is to first create a better certificate locally (once):

ssh-keygen -t ed25519 -a 64

And then temporarily 'backtrack' locally to then copy the new id up to each server that needs it (in the case that you don't want to type in the user password, or cannot...) :

# as root :
update-crypto-policies --set LEGACY

# as user :
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@example.com
#... more as required ...


# as root :
update-crypto-policies --set DEFAULT

This leaves the local system with 'modern' defaults again, while giving the server the option of a stronger key.

Later on, I guess, we should update the server(s) which accept older keys to exclude ssh-rsa keys too.

End

All done!