Part 1 (of 3): The Exposition
Welcome to part one of a three part series that will show how to use Kubernetes, Kubeflow, Apache Spark, and Apache Mahout to denoise CT scans. Weāll do this with a real-world use case of analyzing COVID-19 CT Scans. Although weāre publishing this blog post series now, the research that inspired this series began in the spring of 2020 and was published in the fall of 2020. Itās a solid example nonetheless of the data problem involved in denoising low-dose CT scans, and the clever use of open source software to get to a quick and easily reproducible method for doing just that. As Fry said in Futurama, āTime makes fools of us all.ā As such there are some bits that are broken now, and I will do my best to call them out with potential workarounds.Ā
So we begin with a bit of exposition to introduce our open source characters and our problem.
The characters on this adventure will be Kubernetes, Kubeflow, Apache Spark, and Apache Mahoutāa motley crew indeed.Ā
They were born in different generations to solve similar but different problems. Can they overcome their differences to create something helpful for denoising low-dose CT scans to help radiologists do early diagnoses of COVID-19? Since this is the first time youāre hearing about it (now, about two years into the pandemic), you can probably guess, no. BUT! There is a good story here, with lots of important takeaways, so pour yourself a glass of tea (or whiskey) and cozy up by the fire as we present Denoising CT Scans With Open Source Friends, a story in three parts.Ā
An overview of the three parts (WARNING: SPOILERS)
- Exposition (getting to know our main characters – you are here)
- The problem (why not just use rapid tests?)
- How to get our open source characters to work together in an easily reproducible method for denoising low-dose CT scans
Kubernetes
What is Kubernetes?
āKubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.ā (src: Kubernetes website)
Why do we need it in our use case?
This is my favorite āworks on my machineā meme, but weāre not just going to be using one machine (e.g. Docker, like the meme), weāre going to be using multiple āmachinesā (really containers), and Kubernetes keeps track of them for us.
Weāre using Kubernetes to make this easily reproducible for others. Design once, run anywhere. Kubernetes is the de facto container orchestration layer (note I said de facto, not the only). It works on prem or any serious cloud ā GCP, AWS, Azure, IBM Cloud, DigitalOcean, etc.Ā
Note: I am grossly understating the power and usefulness of Kubernetes here. Please, if youāre not familiar with Kubernetes, go check it out, read blogs, see talks, learn about, and get familiar with itāitās an amazing technology.Ā
And secondly, weāre using Kubernetes because we want to useā¦
Kubeflow
What is Kubeflow?
āThe Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.ā (src: Kubeflow website)Ā
To learn more about Kubeflow, let me also direct you to the free courses at Arriktoās Kubeflow Academy.Ā
Why do we need it in our use case?
More succinctly, Kubeflow lets us do all of this work, then post our pipeline, and then anyone else can easily reproduce our work by:
- Fetching the data we used
- Cloning our notebooks/pipeline
- Uploading and running on their deployment.Ā
OR if they want to extend our work (for instance doing some deep learning at the end) they do those three steps and then just add a step at the end for whatever magic they want to tack on.Ā
OR if they want to use our system to denoise other CT scans, they just need to put said scans in a directory mimicking the structure of our input data.Ā
OR other.Ā
The point is that Kubeflow makes it so much easier to reproduce research, and there are lots of sources about why that is important.
Grumpy Cat doesn’t like to reproduce research and neither do graduate students, but that is how sciences (and graduate programs) work.Ā
Iām just tangentially touching on this; thereās an entire other blog post here about:
- Why reproducible research is goodĀ
- Why Kubeflow is great for making findings reproducible and extendableĀ
- Why people should not only post their code with papers, but also full pipelines (because preprocessing steps are another reason sometimes papers canāt be reproduced).
OK- that ends this soapbox, but there will be more (soapboxes), I promise.
Apache Spark
What is Apache Spark?
āApache Sparkā¢ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.ā (src: Apache Spark website)Ā
While still exciting and new, itās a bit more mature of a tech, and because you are ālean-inā technologists, Iām going to assume a bit of familiarity here. If youāre not yet familiar, spending a bit of time learning about what Apache Spark is, map-reduce technology, and how Apache Sparkās in-memory implementation of map-reduce changed the proverbial game would be time well spent.
Why do we need it in our use case?
When we start working with CT scans ā they are big, and when we try to do a matrix inversion (in the next post), to do so on a single node in NumPy was estimated to need ~500GB of RAM!
Some computers have that now (in 2022), but thatās a harsh requirement.
Spark lets multiple smaller machines work together on a single job. Lots of smaller machines are easier to marshall (and less costly) than a single large machine. This removes a barrier to access because lots of smaller machines can be cheaply and easily rented on cloud.Ā
But note, weāre only using Spark for the layer that lets multiple nodes work together. For the actual Matrix math, weāll be using a library calledā¦
Apache Mahout
What is Apache Mahout?
āApache Mahout is a distributed linear algebra framework and Mathematically expressive Scala DSL designed to let mathematicians, statisticians, and data scientists implement their own algorithms.ā (src: Apache Mahout website)
You might be thinking, āwait, Iāve heard of Mahout, but I was thinking of something else. Am I confused?ā
Well, I happen to be the PMC Chair of the project (one of many leaders who is stuck making sure the quarterly reports get filed). As such let me take a moment to clarify:
Mahout got its start as an ML subproject of the Apache Lucene project (a project focused on search which gave rise to and is still used by Apache SOLR, Elasticsearch, and others). It was spun off as its own separate entity in the late 00s. I mention this just so I have an excuse to show the original logo, which is one of my favorites in all of open source.Ā
The spun-off Mahout was the original distributed Machine Learning Library ā āMachine Learning on Hadoopā and Mahout are synonymous. It was in the first half of the 10s that Mahout rose to fame ā this is the Mahout youāre probably thinking of.
But then Hadoop started losing popularity to Spark, and way led on to way, and in 2014/2015 a ānewā Mahout was madeāthis one used Spark as the default back endāand introduced a Scala domain-specific language that made it look like R (that is, you could write R-like expressions in your Scala code, which are much easier to read mathematically).Ā
But now it has been another half a decade, Scalaās popularity is starting to wane, and several of the projectās leaders are working on blockchain related issues either for fun or as part of our day job (Iām doing it for fun, my day job is Kubeflow).Ā
When Mahout started it was the first and only distributed machine learning library. Now weāre just one of a nameless rabble, so weāve recently decided to extend the library for doing analytics on blockchain ledgers.
For more information (or to get involved), join the Apache Mahout mailing list here.
Why do we need it in our use case?
While Spark and SparkML have some built-in functionality for dealing with matrices, they have issues with matrices that wonāt fit inside a single node.Ā
This isnāt a problem for Spark because there arenāt that many people who want to do matrix algebra anymore, but instead prefer machine learning algorithms.Ā
But Mahout has much better support especially for distributed matrices. Specifically, weāre going to need to do a matrix inversion- and Iām purposefully hand-waving here since Advanced Linear Algebra wasnāt a pre-requirement for this post, and even if it was, I donāt yet know how to make this interesting. The takeaway is:Ā
Mahout has a method called Distributed Stochastic Singular Value Decomposition, which weāll be shortening to DS-SVD for the rest of the series AND for our project we needed DS-SVD.
End of Part 1
The problem our band of heroes will hope to solve will be that of COVID screening.Ā
But how?! You might have noticed NONE of these projects have anything to do with COVID-19 or CT scans, and while Kubernetes and Kubeflow are CNCF ecosystem buddies, and Mahout and Spark are Apache Big Data ecosystem buddies, CNCF and Apache Big Data products are not renowned for their tendency to play nicely together.Ā
How will our heroes learn to overcome their differences and work together to solve a problem that would be way out of scope for each of them individually? Read part two to find out!