Kaggle’s Natural Language Processing with Disaster Tweets as a Kubeflow Pipeline

Welcome to the latest installment of Arrikto’s ongoing series of blog posts that demonstrate how to take popular Kaggle competitions and convert them into Kubeflow Pipelines. All the converted Kaggle competitions in the series are contributed to the Kubeflow project for others to use and distribute.

Wait, What’s Kubeflow?

Kubeflow is an open source, cloud-native MLOps platform originally developed by Google that aims to provide all the tooling that both data scientists and machine learning engineers need. Features include model development, training, serving, AutoML, monitoring, artifact management and experiment tracking. The latest 1.5 release features contributions from Google, Arrikto, IBM, Twitter and Rakuten. Want to try it for yourself? Get started in minutes with a free trial of Kubeflow as a Service, no credit card required.

About the NLP with Disaster Tweets Competition

Here’s the competition description from the Kaggle website:

“Twitter has become an important communication channel in times of emergency.

The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. For example:

The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we’ve created a quick tutorial to get you up and running.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

Acknowledgments

This dataset was created by the company figure-eight and originally shared on their 

‘Data For Everyone’ website here.

You can download all the code, notebooks and data used in this Kaggle competition on GitHub.

Prerequisites for Building the Kubeflow Pipeline

Kubeflow

If you don’t already have Kubeflow up and running, we recommend signing up for a free trial of Kubeflow as a Service.

Pipeline Development

Skills that will be helpful:

  • Python basics, including working with pickle files
  • Familiarity with pandas and NLP libraries such as NLTK, regex, strings, tqdm, TensorFlow and Keras functions
  • Data preprocessing for NLP models, tokenizers and pad sequences

Building the Kubeflow Pipeline

The steps for creating a Kubeflow Pipeline using KFP are as follows:

Step 1 : Defining functions

The function should be defined in such a way that every library which is being used should be imported.

Step 2: Passing data between components

The best way to pass big data files is to use KFP components such as InputPath() and OutputPath() which store the location of the input and output files (generally we use this for big files such as CSV or TEXT files).

  • To download the initial data we’ll pass a URL
  • We’ll then use the URL in a function
  • Next, we’ll store the output as a pickle file in the location and save the location’s name in the OutputPath() type variable
  • This location’s name variable is then passed as InputPath() to the next component and then we extract the contents of the pickle file and use it

This cycle of using data, storing it in a pickle file and passing the location of the pickle file as input to the next step is used again and again, to pass big data files between steps.

Step 3 : Converting the functions into components

In order to convert functions into components we will use:

kfp.components.create_component_from_func()

This function takes mainly three arguments.

  • The name of the function which is to be converted into a component
  • The list of packages to be installed as a list under the argument name as packages_to_install=[]
  • output_component_file which is defined by us as a .yaml file.

Step 4 : Defining the pipeline function

We are now ready to define the pipeline using @dsl.pipeline. We add a name and description, then define a function for this pipeline. The function has arguments passed on, which are used as input to the components created earlier. We then pass the output of one component as an input argument to the next component. 

Step 5 : Running the pipeline

To run the pipeline we use kfp.Client() and create an object of the class. We then use the create_run_from_pipeline_func function to run the pipeline by passing it the name of the pipeline and the arguments which are required as input.

To run the trial notebook execute the following commands:

  • Go to the Kubeflow dashboard and on the left panel click on Notebooks.
  • Click on the + New Notebook button on the top right and create a notebook by giving it a name.
  • Change the workspace volume from 5 GB to 50 GB, and change the requested memory to 6 GB.
  • After the notebook setup is done, click on the Connect button next to the notebook you just created. It will automatically take you to the JupyterLab UI.
  • Inside of the JupyterLab launcher, start a new terminal session to clone the GitHub repo. In the terminal enter the following command:

$ git clone https://github.com/kubeflow/examples

  • After successfully cloning the repo, open up the examples/natural-language-processing-with-disaster-tweets-kaggle-competition directory
  • Open the notebook named “natural-language-processing-with-disaster-tweets-kfp.ipynb”
  • To run this notebook, click on the restart the whole kernel and re-run the whole notebook button (the “fast-forward” icon) in the top menu of the notebook 
  • View run details immediately after submitting the pipeline

Is There an Easier Way to Create a Kubeflow Pipeline?

You bet! If you want to automate most of the steps illustrated in the previous example, then we recommend making use of the open source JupyterLab extension called Kale. Kale is built right into Kubeflow as Service and provides a simple UI for defining Kubeflow Pipelines directly from your JupyterLab notebook, without the need to change a single line of code, or build and push Docker images. In this next example, we’ll show you just how easy it is. 

Understanding Kale Tags

With Kale you annotate cells (which are logical groupings of code) inside your Jupyter notebook with tags. These tags tell Kubeflow how to interpret the code contained in the cell, what dependencies exist and what functionality is required to execute the cell.

Prepare the environment

  • Launch a Notebook server
  • Bump the workspace volume to 60 GB and the memory to 12 GB
  • Open up a terminal session
  • Clone the kubeflow/examples GitHub repo

git clone https://github.com/kubeflow/examples

  • Navigate into the following directory:

examples/natural-language-processing-with-disaster-tweets-kaggle-competition

  • Open the natural-language-processing-with-disaster-tweets-kale.ipynb notebook

Step 1: Annotate the notebook with Kale tags

  • In the left-hand sidebar of your notebook, click on the Kale logo and enable it
  • After enabling Kale, give the pipeline a name and description
  • Next, click on the edit button on the right-hand side of each code block and select the cell-type for the cell, add a name to the cell and select the name of the pipeline step it depends on
  • Select pipeline_step from cell-type for all pipeline steps and select skip as cell-type for cells which you want to skip in the pipeline
  • Note: To make sure the pipeline works perfectly, don’t forget to add the name of the component on which it depends 

For example, in the screenshot below we annotate the code block with class_distribution and specify that it depends on the load_data step.

Here’s the complete list of annotations for the notebook along with the steps on which they are dependent on:

Notebook TagAnnotation UsedStep Dependency
Importing LibrariesimportsN/A
Load dataload_dataN/A
Exploratory Data Analysis of tweetseda_dataload_data
Data Cleaningpreprocess_dataeda_data
Corpus Creationcorpus_creationpreprocess_data
Embedding Stepembedding_stepcorpus_creation
Baseline Modelfinal_modelembedding_step
Training Modeltrain_modelfinal_model

Step 2: Run the Kubeflow Pipeline

Once you’ve tagged your notebook:

  • Install the necessary Python packages

pip install --user -r requirements.txt

  • Restart the notebook kernel when prompted in the output
  • Apply the “Skip Cell” annotation to the Python packages install section we just ran

Now click on the Compile and Run button in the Kale widget. Kale will perform the following tasks for you:

  • Validate the notebook
  • Take a snapshot
  • Compile the notebook
  • Upload the pipeline
  • Run the pipeline

In the Running pipeline output, click on the View hyperlink. This will take you directly to the runtime execution graph where you can watch your pipeline execute and update in real time.

That’s it! You now have Kaggle’s NLP with Disaster Tweets competition running as a reproducible Kubeflow Pipeline with less coding and steps thanks to Kale.

By using Kale we eliminated the following steps from the previous example:

  • Defining complex functions and storing inputs and outputs
  • Volume creation and optimization
  • Defining libraries required inside each function
  • Passing data between the pipeline components
  • Writing KFP DSL code to define the pipeline

You can find the data, files and notebook for this example on GitHub here.

What’s Next?