Kubeflow Tips & Tricks – February 23rd, 2022

February 23, 2022

Blog and Kubeflow Updates | Kubeflow | News

Welcome to the latest installment of Arrikto’s Kubeflow tips and tricks blog! In a simple Q&A format, we aim to provide tips and tricks for the intermediate to advanced Kubeflow user. Ok, let’s dive in.

Is there a way to auto-stop notebooks that are idle for a long time, such as overnight? We are looking to reduce resource usage

Yes, in fact it is a setting in Kubeflow, but it is not enabled by default.

It’s called “notebook culling”, and Benjamin Tan wrote a great article about it. You can set it up so that it is enabled by default on a fresh install. But assuming you already have an instance that you want to apply it to- here are the instructions.

Kubeflow 1.5 will update how idleness is calculated. So if you upgrade from 1.4 to 1.5 expect a slight change in behavior.

Credit: Question Keith Adler – Community Slack, Answer: Alexandre Brown and Benjamin Tan

What’s the best way to get Metadata from an experiment?

This is the sort of question that would get blocked as ‘subjective’ on Stack Overflow and probably cost you some karma points. But this is a blog and I am OP and Moderator.

There was a lot of chat on this- the community was sort of drifting towards MLFlow’s metadata tracker is best- then my friend and co-author Boris Lubinsky chimed in:

“MLFlow works great for demos, but I am afraid, it’s not scalable enough for large scale implementations. In general, in my opinion, metadata management was always a weakest link in Kubeflow.”

I trust Boris a lot and would agree (especially since I’ve not done much with metadata tracking personally). So MLFlow, but don’t count on it at scale. Also, a user named Timos chimed in later that it is worth watching KServe as they may be developing better tracking soon.

Credit: Question Андрей Ильичёв, Answer: Compilation from Community and Boris

How do I enable GPUs on MiniKF local?

Without reservation the answer is “You shouldn’t be using Vagrant.” Period. Full stop. You should use it on GCP or AWS. Personally, I prefer GCP.

…
But I have to use it locally because <insert exceptionally complex reason that probably boils down to “no you don’t use it on cloud”>. OK, you have the one in ten million corner case. Also this keeps coming up, so I want to post it here in case Steven stops logging in to community Slack one day and no one can ever figure this out again.

The issue isn’t MiniKF, it’s the default way Vagrant is setup- it can’t access your GPUs. The solution is to make Virtualbox using LibVRT which CAN see your GPUs. I’m just going to copy-paste Steven’s answer from community Slack since it is very thorough.

Steven says:

Here’s a more detailed explanation

First of all, these are instructions for linux. My current setup is a headless machine with 2 GPUs. With this setup, when MiniKF is running the GPUs are “detached” from the main OS and only MiniKF can access them. This may not fit your current use case.

I followed some of the instructions I found here to enable IOMMU https://github.com/bryansteiner/gpu-passthrough-tutorial#—-tutorial

1. IOMMU Setup

Enable IOMMU and CPU virtualization in the machine BIOS
Enable iommu in the boot kernel parameters (In my case the system uses grub2, I edited /etc/default/grub to add amd_iommu=on iommu=pt then regenerated the config grub2-mkconfig -o /boot/grub2/grub.cfg)
Reboot
Verify that IOMMU is correctly enabled dmesg | grep -i -e DMAR -e IOMMU
Find IOMMU groups of your GPUs and hardware info. (This output all your devices, search for VGA or NVIDIA to quickly find the values we’re looking for)

#!/bin/bash
for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=${d#*/iommu_groups/*}; n=${n%%/*}
  printf 'IOMMU Group %s ' "$n"
  lspci -nns "${d##*/}"
done

In my case both cards are already on isolated IOMMU groups, no patching needed. We will refer to these values multiple time through the process.

IOMMU Group 16 08:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1)
IOMMU Group 16 08:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)
IOMMU Group 47 43:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1)
IOMMU Group 47 43:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)

2. libvirt setup (https://github.com/bryansteiner/gpu-passthrough-tutorial#—-part-2-vm-logistics)

Enable hook support for qemu wget ‘https://raw.githubusercontent.com/PassthroughPOST/VFIO-Tools/master/libvirt_hooks/qemu’ –O /etc/libvirt/hooks/qemu chmod +x /etc/libvirt/hooks/qemu
Create config with the device addresses /etc/libvirt/hooks/kvm.conf. The addresses are the one displayed when executing the IOMMU group script. 08:00.1 -> pci_0000_08_00_1

## Virsh devices
VIRSH_VIDEO_1=pci_0000_08_00_0
VIRSH_AUDIO_1=pci_0000_08_00_1

VIRSH_VIDEO_2=pci_0000_43_00_0
VIRSH_AUDIO_2=pci_0000_43_00_1

Create bind hook /etc/libvirt/hooks/qemu.d/minikf_default/prepare/begin/bind_vfio.sh

#!/bin/bash

## Load the config file
source "/etc/libvirt/hooks/kvm.conf"

## Load vfio
modprobe vfio
modprobe vfio_iommu_type1
modprobe vfio_pci

## Unbind gpu from nvidia and bind to vfio
virsh nodedev-detach $VIRSH_VIDEO_1
virsh nodedev-detach $VIRSH_AUDIO_1
virsh nodedev-detach $VIRSH_VIDEO_2
virsh nodedev-detach $VIRSH_AUDIO_2

Create unbind hook /etc/libvirt/hooks/qemu.d/minikf_default/release/end/unbind_vfio.sh

#!/bin/bash

## Load the config file
source "/etc/libvirt/hooks/kvm.conf"

## Load vfio
modprobe vfio
modprobe vfio_iommu_type1
modprobe vfio_pci

## Unbind gpu from nvidia and bind to vfio
virsh nodedev-reattach $VIRSH_VIDEO_1
virsh nodedev-reattach $VIRSH_AUDIO_1
virsh nodedev-reattach $VIRSH_VIDEO_2
virsh nodedev-reattach $VIRSH_AUDIO_2

modprobe -r vfio_pci
modprobe -r vfio_iommu_type1
modprobe -r vfio

execute chmod +x /etc/libvirt/hooks/qemu.d/minikf_default/prepare/begin/bind_vfio.sh /etc/libvirt/hooks/qemu.d/minikf_default/release/end/unbind_vfio.sh to make hooks executable.
Edit vfio conf /etc/modprobe.d/vfio.conf. The ids corresponds to the NVIDIA devices from the IOMMU Groups step. (I’m not sure if this step is required)

options vfio-pci ids=10de:2204,10de:1aef

restart libvirt

3. Vagrant setup

Install vagrant mutate vagrant plugin install vagrant-mutateMutate minikf box to libvirt vagrant mutate arrikto/minikf libvirt
Vagrant mutate does not copy all the box configuration, we need to copy the include folder from the virtualbox version ~/.vagrant.d/boxes/arrikto-VAGRANTSLASH-minikf/20210428.0.1/virtualbox/ into the libvirt version ~/.vagrant.d/boxes/arrikto-VAGRANTSLASH-minikf/20210428.0.1/libvirt/
Edit local Vagrantfile to change the cpu/memory. In my case I allocate 30 cpus and 40GB of memory. Add this section to the file

config.vm.provider :libvirt do |libvirt|
	libvirt.cpus = 30
	libvirt.memory = 40960
  End

Run the box vagrant up --provider=libvirt

4. Additional box config

Now that the VM is created, we need to attach the GPUs. (Note: These steps could probably be added into the Vagrantfile but I’m not familiar with it)

Stop the VM: vagrant halt
Enable kvm hidden state: sudo virsh edit minikf_default. Add these lines in the <features> section.

 <kvm>
	<hidden state='on'/>
 </kvm>

Create config files for the GPUs. The address info are the one found when executing the IOMMU group script. 08:00.1 -> bus 0x08, slot 0x00, function 0x1

Device_gpu1.xml

<hostdev mode='subsystem' type='pci' managed='yes'>
 <driver name='vfio'/>
 <source>
  <address domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
 </source>
</hostdev>

Device_gpu1_audio.xml

<hostdev mode='subsystem' type='pci' managed='yes'>
 <driver name='vfio'/>
 <source>
  <address domain='0x0000' bus='0x08' slot='0x00' function='0x1'/>
 </source>
</hostdev>

Repeat for device_gpu2.xml and device_gpu_2.xml if necessary

Attach the devices to the vm

sudo virsh attach-device minikf_default --config device_gpu1.xml
sudo virsh attach-device minikf_default --config device_gpu1_audio.xml
sudo virsh attach-device minikf_default --config device_gpu2.xml
sudo virsh attach-device minikf_default --config device_gpu2_audio.xml

Restart vagrant box vagrant up –provider=libvirt

The GPUs should now be available inside MiniKF! You can verify by going into the minikf box sudo vagrant ssh and executing nvidia-smi to list the devices.

~Steven Payre

Credit: Question- Ben Pashley, Answer- Steven Payre

How does one get involved in the Kubeflow Community?

Great question and glad you asked.

Every community is different, and the Kubeflow community has some quarks of its own. If you go to the “Community” Section of the Kubeflow Website you’ll see information on how to join Slack, and various mailing lists.

In the Kubeflow Community- the Community Slack does appear to be the primary way users get support (also why so many questions in this series are pulled from there.) But you’ll also see lots of mailing lists/google groups to join.

So joining the Slack (and participating) is a good first (and second) step. But after that what? My advice would be to give talks at meetups, or even volunteer to be a co-organizer for a meetup.

Speaking at meetups is a low stress/low stakes way to start making a name for yourself in the community. People starting out think they need to give conference level talks for meetups. Not true. I think of meetups like “open mic nights” for stand up comics. They’re a great place to try out new material, and polish things you think that work; the conference on the other hand is more akin to an HBO special. Pauly Shore doesn’t go straight to HBO- he workshops smaller venues honing his material.

The corollary to that is- you also don’t have to present “finished” projects. You can present something you’re currently working on and get ideas from other people interested in the topic.

But public speaking isn’t for everyone- the other option is co-organizing meetups. A co-organizer is in charge of finding speakers. In COVID times they will also sometimes be in charge of setting up the Zoom, but after the COVID times, they will be in charge of finding a venue for the meetup to take place at and coordinating with sponsors who provide refreshments. Co Organizing a meetup is like planning a party once a month. You also end up developing a local community of enthusiasts.

If you’re interested in either speaking OR being a local co-organizer for Kubeflow and MLOps Meetup- please email me at trevor.grant@arrikto.com. As a note we call out the Meetups in the next section- if your metro isn’t listed but you’re interested, that’s OK- we’d love to help you setup a new chapter.

Credit: Me, Kubeflow community and ASF member

February ‘22 Kubeflow Community Highlights

Slack

As of the time of writing the Kubeflow Slack channel has just shy of 7000 members, how cool is that ?!?

GitHub

Here’s the Current GitHub stats for a few of the various Kubeflow projects.

Kubeflow: 11,216 stars. 37 new issues
Pipelines: 34 new issues
Katib: 7 new issues
KServe: 5 new issues
Manifests: 6 new issues

Meetups

It’s still too early for us to get together in person, but it doesn’t mean we can’t get together virtually to learn and discuss topics related to Kubeflow and MLOps! Arrikto has taken the initiative to set up a network of a dozen vendor neutral “Data Science, Machine Learning, MLOps and Kubeflow” Meetups in several major metros. Over 2,400+ community members have already signed up!

Join a Meetup near you and check out the calendar of upcoming talks.

Are you interested in speaking at a future Meetup?
Is your company interested in sponsoring a Meetup?
Would you like to be a co-organizer of a local Meetup? Send one of the organizers/hosts a message on Meetup.com!

Book a FREE Kubeflow and MLOps workshop

This FREE virtual workshop is designed with data scientists, machine learning developers, DevOps engineers and infrastructure operators in mind. The workshop covers basic and advanced topics related to Kubeflow, MiniKF, Rok, Katib and KFServing. In the workshop you’ll gain a solid understanding of how these components can work together to help you bring machine learning models to production faster. Click to schedule a workshop for your team.

About Arrikto

At Arrikto, we are active members of the Kubeflow community having made significant contributions to the latest 1.4 release. Our projects/products include:

Kubeflow as a Service is the easiest way to get started with Kubeflow in minutes! It comes with a Free 7-day trial (no credit card required).
Enterprise Kubeflow (EKF) is a complete machine learning operations platform that simplifies, accelerates, and secures the machine learning model development life cycle with Kubeflow.
Rok is a data management solution for Kubeflow. Rok’s built-in Kubeflow integration simplifies operations and increases performance, while enabling data versioning, packaging, and secure sharing across teams and cloud boundaries.
Kale, a workflow tool for Kubeflow, which orchestrates all of Kubeflow’s components seamlessly.