Denoising CT Scans with Kubeflow, Apache Spark, and Apache Mahout – Part 1

Part 1 (of 3): The Exposition

Welcome to part one of a three part series that will show how to use Kubernetes, Kubeflow, Apache Spark, and Apache Mahout to denoise CT scans. Weā€™ll do this with a real-world use case of analyzing COVID-19 CT Scans. Although weā€™re publishing this blog post series now, the research that inspired this series began in the spring of 2020 and was published in the fall of 2020. Itā€™s a solid example nonetheless of the data problem involved in denoising low-dose CT scans, and the clever use of open source software to get to a quick and easily reproducible method for doing just that. As Fry said in Futurama, ā€œTime makes fools of us all.ā€ As such there are some bits that are broken now, and I will do my best to call them out with potential workarounds.Ā 

So we begin with a bit of exposition to introduce our open source characters and our problem.

The characters on this adventure will be Kubernetes, Kubeflow, Apache Spark, and Apache Mahoutā€”a motley crew indeed.Ā 

They were born in different generations to solve similar but different problems. Can they overcome their differences to create something helpful for denoising low-dose CT scans to help radiologists do early diagnoses of COVID-19? Since this is the first time youā€™re hearing about it (now, about two years into the pandemic), you can probably guess, no. BUT! There is a good story here, with lots of important takeaways, so pour yourself a glass of tea (or whiskey) and cozy up by the fire as we present Denoising CT Scans With Open Source Friends, a story in three parts.Ā 

An overview of the three parts (WARNING: SPOILERS)

  1. Exposition (getting to know our main characters – you are here)
  2. The problem (why not just use rapid tests?)
  3. How to get our open source characters to work together in an easily reproducible method for denoising low-dose CT scans

 

Kubernetes

Src: Wikipedia

What is Kubernetes?

ā€œKubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.ā€ (src: Kubernetes website)

 

Why do we need it in our use case?

Src: Reddit Programmer Humor

This is my favorite ā€œworks on my machineā€ meme, but weā€™re not just going to be using one machine (e.g. Docker, like the meme), weā€™re going to be using multiple ā€œmachinesā€ (really containers), and Kubernetes keeps track of them for us.

Weā€™re using Kubernetes to make this easily reproducible for others. Design once, run anywhere. Kubernetes is the de facto container orchestration layer (note I said de facto, not the only). It works on prem or any serious cloud ā€” GCP, AWS, Azure, IBM Cloud, DigitalOcean, etc.Ā 

Note: I am grossly understating the power and usefulness of Kubernetes here. Please, if youā€™re not familiar with Kubernetes, go check it out, read blogs, see talks, learn about, and get familiar with itā€”itā€™s an amazing technology.Ā 

And secondly, weā€™re using Kubernetes because we want to useā€¦

 

Kubeflow

Src: Wikipedia

What is Kubeflow?

ā€œThe Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.ā€ (src: Kubeflow website)Ā 

To learn more about Kubeflow, let me also direct you to the free courses at Arriktoā€™s Kubeflow Academy.Ā 

 

Why do we need it in our use case?

More succinctly, Kubeflow lets us do all of this work, then post our pipeline, and then anyone else can easily reproduce our work by:

  1. Fetching the data we used
  2. Cloning our notebooks/pipeline
  3. Uploading and running on their deployment.Ā 

OR if they want to extend our work (for instance doing some deep learning at the end) they do those three steps and then just add a step at the end for whatever magic they want to tack on.Ā 

OR if they want to use our system to denoise other CT scans, they just need to put said scans in a directory mimicking the structure of our input data.Ā 

OR other.Ā 

The point is that Kubeflow makes it so much easier to reproduce research, and there are lots of sources about why that is important.

 

Grumpy Cat doesn’t like to reproduce research and neither do graduate students, but that is how sciences (and graduate programs) work.Ā 

Iā€™m just tangentially touching on this; thereā€™s an entire other blog post here about:

  1. Why reproducible research is goodĀ 
  2. Why Kubeflow is great for making findings reproducible and extendableĀ 
  3. Why people should not only post their code with papers, but also full pipelines (because preprocessing steps are another reason sometimes papers canā€™t be reproduced).

OK- that ends this soapbox, but there will be more (soapboxes), I promise.

 

Apache Spark

Src: Wikipedia

What is Apache Spark?

ā€œApache Sparkā„¢ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.ā€ (src: Apache Spark website)Ā 

While still exciting and new, itā€™s a bit more mature of a tech, and because you are ā€˜lean-inā€™ technologists, Iā€™m going to assume a bit of familiarity here. If youā€™re not yet familiar, spending a bit of time learning about what Apache Spark is, map-reduce technology, and how Apache Sparkā€™s in-memory implementation of map-reduce changed the proverbial game would be time well spent.

 

Why do we need it in our use case?

When we start working with CT scans ā€” they are big, and when we try to do a matrix inversion (in the next post), to do so on a single node in NumPy was estimated to need ~500GB of RAM!

Some computers have that now (in 2022), but thatā€™s a harsh requirement.

Spark lets multiple smaller machines work together on a single job. Lots of smaller machines are easier to marshall (and less costly) than a single large machine. This removes a barrier to access because lots of smaller machines can be cheaply and easily rented on cloud.Ā 

But note, weā€™re only using Spark for the layer that lets multiple nodes work together. For the actual Matrix math, weā€™ll be using a library calledā€¦

 

Apache Mahout

What is Apache Mahout?

ā€œApache Mahout is a distributed linear algebra framework and Mathematically expressive Scala DSL designed to let mathematicians, statisticians, and data scientists implement their own algorithms.ā€ (src: Apache Mahout website)

You might be thinking, ā€œwait, Iā€™ve heard of Mahout, but I was thinking of something else. Am I confused?ā€

Well, I happen to be the PMC Chair of the project (one of many leaders who is stuck making sure the quarterly reports get filed). As such let me take a moment to clarify:

Mahout got its start as an ML subproject of the Apache Lucene project (a project focused on search which gave rise to and is still used by Apache SOLR, Elasticsearch, and others). It was spun off as its own separate entity in the late 00s. I mention this just so I have an excuse to show the original logo, which is one of my favorites in all of open source.Ā 

The spun-off Mahout was the original distributed Machine Learning Library ā€” ā€œMachine Learning on Hadoopā€ and Mahout are synonymous. It was in the first half of the 10s that Mahout rose to fame ā€” this is the Mahout youā€™re probably thinking of.

But then Hadoop started losing popularity to Spark, and way led on to way, and in 2014/2015 a ā€œnewā€ Mahout was madeā€”this one used Spark as the default back endā€”and introduced a Scala domain-specific language that made it look like R (that is, you could write R-like expressions in your Scala code, which are much easier to read mathematically).Ā 

But now it has been another half a decade, Scalaā€™s popularity is starting to wane, and several of the projectā€™s leaders are working on blockchain related issues either for fun or as part of our day job (Iā€™m doing it for fun, my day job is Kubeflow).Ā 

When Mahout started it was the first and only distributed machine learning library. Now weā€™re just one of a nameless rabble, so weā€™ve recently decided to extend the library for doing analytics on blockchain ledgers.

For more information (or to get involved), join the Apache Mahout mailing list here.

 

Why do we need it in our use case?

While Spark and SparkML have some built-in functionality for dealing with matrices, they have issues with matrices that wonā€™t fit inside a single node.Ā 

This isnā€™t a problem for Spark because there arenā€™t that many people who want to do matrix algebra anymore, but instead prefer machine learning algorithms.Ā 

But Mahout has much better support especially for distributed matrices. Specifically, weā€™re going to need to do a matrix inversion- and Iā€™m purposefully hand-waving here since Advanced Linear Algebra wasnā€™t a pre-requirement for this post, and even if it was, I donā€™t yet know how to make this interesting. The takeaway is:Ā 

Mahout has a method called Distributed Stochastic Singular Value Decomposition, which weā€™ll be shortening to DS-SVD for the rest of the series AND for our project we needed DS-SVD.

 

End of Part 1

The problem our band of heroes will hope to solve will be that of COVID screening.Ā 

But how?! You might have noticed NONE of these projects have anything to do with COVID-19 or CT scans, and while Kubernetes and Kubeflow are CNCF ecosystem buddies, and Mahout and Spark are Apache Big Data ecosystem buddies, CNCF and Apache Big Data products are not renowned for their tendency to play nicely together.Ā 

How will our heroes learn to overcome their differences and work together to solve a problem that would be way out of scope for each of them individually? Read part two to find out!