Denoising CT Scans with Kubeflow, Apache Spark, and Apache Mahout – Part 1

Part 1 (of 3): The Exposition

Welcome to part one of a three part series that will show how to use Kubernetes, Kubeflow, Apache Spark, and Apache Mahout to denoise CT scans. We’ll do this with a real-world use case of analyzing COVID-19 CT Scans. Although we’re publishing this blog post series now, the research that inspired this series began in the spring of 2020 and was published in the fall of 2020. It’s a solid example nonetheless of the data problem involved in denoising low-dose CT scans, and the clever use of open source software to get to a quick and easily reproducible method for doing just that. As Fry said in Futurama, “Time makes fools of us all.” As such there are some bits that are broken now, and I will do my best to call them out with potential workarounds. 

So we begin with a bit of exposition to introduce our open source characters and our problem.

The characters on this adventure will be Kubernetes, Kubeflow, Apache Spark, and Apache Mahout—a motley crew indeed. 

They were born in different generations to solve similar but different problems. Can they overcome their differences to create something helpful for denoising low-dose CT scans to help radiologists do early diagnoses of COVID-19? Since this is the first time you’re hearing about it (now, about two years into the pandemic), you can probably guess, no. BUT! There is a good story here, with lots of important takeaways, so pour yourself a glass of tea (or whiskey) and cozy up by the fire as we present Denoising CT Scans With Open Source Friends, a story in three parts. 

An overview of the three parts (WARNING: SPOILERS)

  1. Exposition (getting to know our main characters – you are here)
  2. The problem (why not just use rapid tests?)
  3. How to get our open source characters to work together in an easily reproducible method for denoising low-dose CT scans

 

Kubernetes

Src: Wikipedia

What is Kubernetes?

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.” (src: Kubernetes website)

 

Why do we need it in our use case?

Src: Reddit Programmer Humor

This is my favorite “works on my machine” meme, but we’re not just going to be using one machine (e.g. Docker, like the meme), we’re going to be using multiple “machines” (really containers), and Kubernetes keeps track of them for us.

We’re using Kubernetes to make this easily reproducible for others. Design once, run anywhere. Kubernetes is the de facto container orchestration layer (note I said de facto, not the only). It works on prem or any serious cloud — GCP, AWS, Azure, IBM Cloud, DigitalOcean, etc. 

Note: I am grossly understating the power and usefulness of Kubernetes here. Please, if you’re not familiar with Kubernetes, go check it out, read blogs, see talks, learn about, and get familiar with it—it’s an amazing technology. 

And secondly, we’re using Kubernetes because we want to use…

 

Kubeflow

Src: Wikipedia

What is Kubeflow?

“The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.” (src: Kubeflow website

To learn more about Kubeflow, let me also direct you to the free courses at Arrikto’s Kubeflow Academy

 

Why do we need it in our use case?

More succinctly, Kubeflow lets us do all of this work, then post our pipeline, and then anyone else can easily reproduce our work by:

  1. Fetching the data we used
  2. Cloning our notebooks/pipeline
  3. Uploading and running on their deployment. 

OR if they want to extend our work (for instance doing some deep learning at the end) they do those three steps and then just add a step at the end for whatever magic they want to tack on. 

OR if they want to use our system to denoise other CT scans, they just need to put said scans in a directory mimicking the structure of our input data. 

OR other. 

The point is that Kubeflow makes it so much easier to reproduce research, and there are lots of sources about why that is important.

 

Grumpy Cat doesn’t like to reproduce research and neither do graduate students, but that is how sciences (and graduate programs) work. 

I’m just tangentially touching on this; there’s an entire other blog post here about:

  1. Why reproducible research is good 
  2. Why Kubeflow is great for making findings reproducible and extendable 
  3. Why people should not only post their code with papers, but also full pipelines (because preprocessing steps are another reason sometimes papers can’t be reproduced).

OK- that ends this soapbox, but there will be more (soapboxes), I promise.

 

Apache Spark

Src: Wikipedia

What is Apache Spark?

“Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.” (src: Apache Spark website

While still exciting and new, it’s a bit more mature of a tech, and because you are ‘lean-in’ technologists, I’m going to assume a bit of familiarity here. If you’re not yet familiar, spending a bit of time learning about what Apache Spark is, map-reduce technology, and how Apache Spark’s in-memory implementation of map-reduce changed the proverbial game would be time well spent.

 

Why do we need it in our use case?

When we start working with CT scans — they are big, and when we try to do a matrix inversion (in the next post), to do so on a single node in NumPy was estimated to need ~500GB of RAM!

Some computers have that now (in 2022), but that’s a harsh requirement.

Spark lets multiple smaller machines work together on a single job. Lots of smaller machines are easier to marshall (and less costly) than a single large machine. This removes a barrier to access because lots of smaller machines can be cheaply and easily rented on cloud. 

But note, we’re only using Spark for the layer that lets multiple nodes work together. For the actual Matrix math, we’ll be using a library called…

 

Apache Mahout

What is Apache Mahout?

“Apache Mahout is a distributed linear algebra framework and Mathematically expressive Scala DSL designed to let mathematicians, statisticians, and data scientists implement their own algorithms.” (src: Apache Mahout website)

You might be thinking, “wait, I’ve heard of Mahout, but I was thinking of something else. Am I confused?”

Well, I happen to be the PMC Chair of the project (one of many leaders who is stuck making sure the quarterly reports get filed). As such let me take a moment to clarify:

Mahout got its start as an ML subproject of the Apache Lucene project (a project focused on search which gave rise to and is still used by Apache SOLR, Elasticsearch, and others). It was spun off as its own separate entity in the late 00s. I mention this just so I have an excuse to show the original logo, which is one of my favorites in all of open source. 

The spun-off Mahout was the original distributed Machine Learning Library — “Machine Learning on Hadoop” and Mahout are synonymous. It was in the first half of the 10s that Mahout rose to fame — this is the Mahout you’re probably thinking of.

But then Hadoop started losing popularity to Spark, and way led on to way, and in 2014/2015 a “new” Mahout was made—this one used Spark as the default back end—and introduced a Scala domain-specific language that made it look like R (that is, you could write R-like expressions in your Scala code, which are much easier to read mathematically). 

But now it has been another half a decade, Scala’s popularity is starting to wane, and several of the project’s leaders are working on blockchain related issues either for fun or as part of our day job (I’m doing it for fun, my day job is Kubeflow). 

When Mahout started it was the first and only distributed machine learning library. Now we’re just one of a nameless rabble, so we’ve recently decided to extend the library for doing analytics on blockchain ledgers.

For more information (or to get involved), join the Apache Mahout mailing list here.

 

Why do we need it in our use case?

While Spark and SparkML have some built-in functionality for dealing with matrices, they have issues with matrices that won’t fit inside a single node. 

This isn’t a problem for Spark because there aren’t that many people who want to do matrix algebra anymore, but instead prefer machine learning algorithms. 

But Mahout has much better support especially for distributed matrices. Specifically, we’re going to need to do a matrix inversion- and I’m purposefully hand-waving here since Advanced Linear Algebra wasn’t a pre-requirement for this post, and even if it was, I don’t yet know how to make this interesting. The takeaway is: 

Mahout has a method called Distributed Stochastic Singular Value Decomposition, which we’ll be shortening to DS-SVD for the rest of the series AND for our project we needed DS-SVD.

 

End of Part 1

The problem our band of heroes will hope to solve will be that of COVID screening. 

But how?! You might have noticed NONE of these projects have anything to do with COVID-19 or CT scans, and while Kubernetes and Kubeflow are CNCF ecosystem buddies, and Mahout and Spark are Apache Big Data ecosystem buddies, CNCF and Apache Big Data products are not renowned for their tendency to play nicely together. 

How will our heroes learn to overcome their differences and work together to solve a problem that would be way out of scope for each of them individually? Read part two to find out!