DevOps and Data Science: The Wrong Way

Test Your Might

Machine learning engineering and data science, two teams often locked in combat over who knows best and what tools benefit them the most. Our previous blog post dove into what are the unforeseen costs of failing to negotiate a peaceful path forward. The ceasefire we are negotiating should not only solve the problem, but solve the problem in a way that leaves the teams you are engaging with better off than when you met them. If we fail to draft a treaty together that benefits both sides, our teams can become focused on self preservation instead of delivering value to the business. The real threats that teams often overlook are budgets drying up and leadership needing to cut their losses for the health of the business. This blog section is all about how we can relay messages of understanding to teams we are developing solutions for right from the get-go as well as a real life cautionary tale of what happens when you convince yourself you know best. The blog section following this one will give us hope and ways we have gotten things right! This whole process is iterative and all our work on negotiating and relaying understanding all starts at the glass.

Everything Starts at the Glass

If you have ever had a conversation with Patrick Gryzan (another solution architect at Arrikto) he will tell you that “everything starts at the glass”. He will usually say this in an inspiring tone while staring up towards the stars imagining a perfectly clean user interface and an excited end user ready to make magic happen. “The glass” he is referring to is the user interface you put in front of your customer. This initial point of contact is how you communicate your understanding of what they value as a professional. A good UX demonstrates a strong grasp for how a user wants to get their work done, and sends a message of “we hear you” from the development team. If a good UX sends a signal partnership and comprehension, then what does a bad UX signal to our end users? What about a user experience that demands skills that aren’t in a user’s standard repertoire? What happens when you ask data scientists how often they are planning on committing their code so you can push it to a build pipeline, to learn git, or to learn to write secure production ready code? How do these demands negatively impact our critical mission to negotiate the best path forward together? More importantly, how do we heal the wounds we potentially just created by putting a pile of “not my problem figure it out” in front of our data science teams and vice versa?
I can tell you from experience that forcing patterns or platforms on unwilling end users will NOT improve your outcomes at all. Without a shared perception of the problem, both teams are set up for failure. I have seen this first hand! I am far from immune to the “shoot first ask questions later” mentality we sometimes fall into as technologists. I have pushed for a prototype when I didn’t have a clear understanding of the problem I was solving. One such example is when I was trying to help a customer with some cross team collaboration efforts. I aimed hastily, shot quickly, and missed my mark. Had I spent more time looking down my sight, I would have noticed I never had a clear shot in the first place all thanks to the fabled wall of confusion.

The Wall of Confusion

As alluded to in the previous section, I have witnessed first hand the pain of a problematic prototype. During my early days at Arrikto I ran into a customer’s team that had a pretty common problem. They had some data scientists who asked them to convert their code from Jupyter notebooks to hermetically sealed and servable applications. The data science team had their own internal way of training the model as well as packaging their code. They threw the code over the infamous wall of confusion and the platform team’s race was on to make the code palatable for production. The dream was to get both teams to work within the same Kubeflow environment in order to break down the wall between engineering and the data scientists. First, we had to prove to the data science team we meant business. How do we do that? Well, as a Kubeflow incentivized architect, I thought my Kubeflow shaped hammer could solve any problem. I quickly used Kubeflow pipelines, PodDefaults, custom scripts, and marshaled volumes to create a “container build notebook” the data science team could use. During my prototyping process, I ended up creating more questions than answers. How do we version and tag the model image I just built? How do we update their production cluster with the “latest” model? What defines “latest”? Who has jurisdiction over what is allowed to be moved to and from production? Who will support Argo and store the credentials? Does a team of this size really benefit from all this automation? What ended up happening is me handing off some code that “worked” but was incredibly complex and error prone. I created a supportability nightmare and although it was “just a prototype” it still failed to show the team the art of possible. It may have had a lot of strong DevOps patterns supporting it, but what about branching strategies and qualifying what a release actually is? All these questions were not going to be solved by my super special proof of concept. I spent valuable time creating a solution that no amount of refining would deliver value to the customer’s team of data scientists. Worse than that, the data science team wouldn’t even give us the time of day. They were unwilling to make any type of investment. We failed a crucial part of the negotiation. We failed to bring both teams to the table without needing a POC until we really understood what we were up against. We needed to understand how the wall of confusion has stayed up this long and what we could do to tear it down.

The IKEA Effect

Maybe they didn’t need the velocity that an automated build and promotion system needed. Maybe there was no incentive or executive pressure on the data science team. I don’t know why the data science team didn’t give my prototype a chance. All I know is that my science experiment wasn’t well received and my ego was a bit bruised. Luckily, I eventually realized that I had been given a fantastic lesson in feedback loops and the need for negotiations. There is no such thing as no choice. The data science team chose to continue to make their custom process the engineering team’s problem. Without cross team empathy and mission focus, no amount of fancy code or tools would help reduce strife. I also learned that even if my POC did “work” I had completely opted out of The IKEA Effect. If you are unaware of The IKEA Effect, it is “a cognitive bias in which consumers place a disproportionately high value on products they partially created”. That means that without understanding how the data science team wants to work, what makes them tick,and how they can contribute to the system we are building, I lost an enormous amount of influence. I wasn’t demonstrating a better way of life at all. I was throwing another vendor tool in front of their data science team and hoping for the best. Had I gotten another opportunity in front of their data science team, I would have worked with them to really understand what drives their decisions and set up an iterative process. I would’ve created a journey we could go on together and focus on a transformative effort versus a blind bet at the craps table. I let my ego get the best of me and thought I knew their problems better than they did. Without both teams’ initial investment, my heroic dose of Kubeflow meant nothing. Fortunately enough, if I had negotiated their needs better I would have had a very powerful framework to continue to move them towards improved productivity. A framework to enable a cultural shift of cross team communication and iterative tooling improvements that commit by commit can help build a better “at the glass moment”. A framework that would demonstrate a deep understanding of that data science team’s objectives and enable them to build a culture of MLOps. That magical framework is DevOps. The DevOps framework, when applied correctly, has already done wonders for data scientists. Don’t believe me? You don’t have to! I will leave the convincing to Stefano in our next blog post DevOps and Data Science: The Right Way.

About the author

At the time of writing, Chase is a solutions architect at Arrikto who's focus is helping people discover "whats in it for them" as they journey into the world of leveraging platforms to accelerate their development efforts. Chase's early career was oriented around manual tasks and testing. Since then, he has pursued the goal of reducing the amount of "job chores" professionals must partake in and giving them time to solve more interesting and valuable problems. Reducing TOIL aligns well with Chase's solutions architecture strategy of people's problems first and technology second to avoid "yet another tool" thrown into the already over encumbered tool box that technology professionals are being pressured to keep up with.

About Kubeflow

Kubeflow is an open source, cloud-native MLOps platform originally developed by Google that aims to provide all the tooling that both data scientists and machine learning engineers need to run workflows in production. Features include model development, training, serving, AutoML, monitoring and artifact management. 

Kubeflow is the open source machine learning toolkit for Kubernetes.

About Arrikto

We are a Machine Learning platform powered by Kubeflow and built for Data Scientists. We make Kubeflow easy to adopt, deploy and use, having made significant contributions since the 0.4 release and continuing to contribute across multiple areas of the project and community. Our projects/products include:

  • Enterprise Kubeflow (EKF) is a complete MLOps platform that reduces costs, while accelerating the delivery of scalable models from laptop to production.
  • Kubeflow as a Service is the easiest way to get started with Kubeflow in minutes! It comes with a Free 7-day trial (no credit card required).
  • Rok is a data management solution for Kubeflow. Rok’s built-in Kubeflow integration simplifies operations and increases performance, while enabling data versioning, packaging, and secure sharing across teams and cloud boundaries.
  • Kale, an open source workflow tool for Kubeflow, which orchestrates all of Kubeflow’s components seamlessly.