• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Data Science

    Topic Modeling and Image Classification with Dataiku and NVIDIA Data Science

    Twitter topic model Dataiku diagram

    The Dataiku platform for everyday AI simplifies deep learning. Use cases are far-reaching, from image classification to object detection and natural language processing (NLP). Dataiku helps you with labeling, model training, explainability, model deployment, and centralized management of code and code environments.

    This post dives into high-level Dataiku and NVIDIA integrations for image classification and object detection. It also covers deep learning model deployment for real-time inference and how to use open source RAPIDS and cuML libraries for a customer support Tweet topic modeling use case. NVIDIA provides the hardware (NVIDIA A10 Tensor Core GPUs, in this case) and various OSS (CUDA, RAPIDS) to get the job done.   

    Note that all of the NVIDIA AI software featured in this post is available with NVIDIA AI Enterprise, a secure, end-to-end software suite for production AI, with enterprise support from NVIDIA. 

    Deep learning for image classification and object detection

    This section walks through the steps to train and evaluate a deep learning model for image classification or object detection using Dataiku and NVIDIA GPUs. 

    A no-code approach

    Starting with Dataiku 11.3, you can use visual, no-code tools to deliver on the core areas of an image classification or object detection workflow. You can label images, draw bounding boxes, and review/govern annotations using a native web app (Figure 1). Image labeling is key to training performant models: good data in → good model out.

    Screenshot of Dataiku’s image labeling tool.
    Figure 1. With Dataiku’s image labeling tool, you can label all cats as “cat,” or with more granularity to fit unique appearance or personality characteristics

    Dataiku enables you to train image classification and object detection models specifically using transfer learning to fine-tune pretrained models based on custom images / labels / bounding boxes. Data augmentation—recoloring, rotating, and cropping training images—is a common way to increase the size of the training set and expose a model to a variety of situations (Figure 2).

    Screenshot of Dataiku’s image augmentation options.
    Figure 2. Using image augmentation, you can account for what may be unanticipated in your model, like cat discos and upside-down camera shots

    EfficientNet (image classification) and Faster R-CNN (object detection) neural networks can be used with pretrained weights in the model retraining user interface, out of the box.

    After training a model to custom image labels and bounding boxes, you can use an overlaid heat map model focus to explain the model’s predictions (Figure 3).

    Screenshot of Dataiku’s image classification model interpretation tool.
    Figure 3. Heat maps show which parts of an image led to a particular prediction from the model

    Once you are comfortable with the model’s performance, deploy the trained model as a containerized inference service to a Kubernetes cluster. This is managed by the Dataiku API Deployer tool.

    Where is the compute happening?

    Dataiku can push all the compute behind deep learning model training, explanations, and inference to NVIDIA GPUs (Figure 4). You can even leverage multiple GPUs for distributed training through the PyTorch DistributedDataParallel module and TensorFlow MirroredStrategy.

    Screenshot of Dataiku interface for activating GPUs for deep learning model training.
    Figure 4. Use the Dataiku interface to activate NVIDIA GPUs for deep learning model training

    Pushing this compute to NVIDIA GPUs happens through the Dataiku Elastic AI integrations. First, connect your Dataiku instance to a Kubernetes cluster with NVIDIA GPU resources (managed through EKS, GKE, AKS, OpenShift). Dataiku will then create the Docker image and deploy containers behind the scenes. 

    Deep learning training and inference jobs can run on a Kubernetes cluster, as well as arbitrary Python code or Apache Spark jobs (Figure 5).

    Diagram of Dataiku’s ability to run the compute for different processes in containers on a Kubernetes cluster.
    Figure 5. Dataiku is the interface, or orchestration layer, that pushes deep learning and other workloads to a Kubernetes cluster, where the compute happens

    Coding model training scripts

    If you want to custom code your own deep learning models in Python, try wrapping a train function in an MLflow experiment tracker. Figure 6 shows a Python-based flow. See the machine learning tutorial in the Dataiku Developer Guide for an example. This approach provides the full flexibility of custom code, along with some out-of-the-box experiment tracking, model analysis visualizations, and point-and-click model deployment from visual trained models in Dataiku.?

    Screenshot of a Dataiku flow to train a custom python model with MLflow. The workflow uses Python recipes to train a custom image classification model using PyTorch and MLflow, then saves trained model versions into a folder, then imports the best one into a Dataiku green diamond model.
    Figure 6. A best-practice Dataiku workflow to train a custom Python model with MLflow

    Custom Python deep learning models can leverage NVIDIA GPUs through containerized execution, like visually trained deep learning models in Dataiku (Figure 7).

    Screenshot of a Python script in Dataiku and containerized compute options.
    Figure 7. Any Python workload in Dataiku can be pushed to a Kubernetes cluster with NVIDIA GPU resources

    Model deployment for real-time inference

    Once the model is trained, it is time to deploy it for real-time inference. If you used Dataiku’s visual image classification, object detection, or a custom coded model with MLflow, then imported as a Dataiku model, all it takes is a few clicks to create a containerized inference API service on top of the trained model.

    First, connect the Dataiku API Deployer tool to a Kubernetes cluster to host these inference API services, again with NVIDIA GPUs available in the cluster nodes. Then deploy 1-N replicas of the containerized service behind a load balancer. From here, edge devices can send requests to the API service and receive predictions back from the model. Figure 8 shows this whole architecture.

    Diagram showing Dataiku trained model > create API service in the API Designer > push the API service to the Deployer > push the API Service to a K8S cluster with NVIDIA GPU resources. From there, edge devices can submit requests to the API service with data, images, and receive predictions back.
    Figure 8. Workflow from a model trained in Dataiku to an API services hosted on a Kubernetes cluster with NVIDIA GPUs for inference

    Tutorial: Accelerate topic modeling using BERT models with RAPIDS in Dataiku

    For a deeper dive, this section walks through how to set up a Python environment in Dataiku to use BERTopic with GPU-accelerated cuML library from RAPIDS. It also highlights the performance gain using cuML. 

    This example uses the Kaggle Customer Support on Twitter dataset and key customer complaint themes with topic modeling.

    Step 1. Prepare the dataset

    First, normalize the Tweet text by removing the punctuation, stop words, and stemming words. Also filter the dataset to the complaints customers tweeted in English. All of this can be done using Dataiku visual recipes. Figure 9 shows the screenshot of the workflow in Dataiku.

    Flow in Dataiku is the Interface to build Project Pipelines. This is the image of the Data Preparation Flow, beginning with tweets on the left and ending with tm_train, tm_test, and tweet_response_by_org on the right.
    Figure 9. Building a project workflow using Dataiku Flow

    Use a split recipe to filter the company’s replies from initial user Tweets. Next, use Dataiku’s Text Preparation plugin recipe to detect language distribution across user Tweets. Figure 10 shows the distribution of Tweets by language.

    Distribution of Tweet data by language: English, Spanish, French, Portuguese, Dutch, Chinese, German, Turkish, Latin, Italian, Japanese.
    Figure 10. Distribution of Tweet data by language

    Use a filter recipe to filter out all non-English and blank Tweets. Be sure to use a text preparation recipe to filter stop words, punctuation, URL, emojis, and so on. Convert text to lowercase.

    Screenshot of Text Cleaning configuration with all list of Tokens to be filtered out, including punctuation, URL, email, and more.
    Figure 11. Configuration of the Dataiku plugin for text cleaning 

    Finally, use a split recipe to split the data for training and testing (simple 80% / 20% random split).

    Step 2. Set up Python environment with BERTopic and RAPIDS library

    Running the Python processes requires an elastic compute environment with NVIDIA GPUs, the Python BERTopic package (and its required packages), and a RAPIDS container image. This example uses an Amazon EKS Cluster (Instance Type:g4dn with NVIDIA A10 Tensor Core GPUs), RAPIDS Release Stable 22.12, and BERTopic (0.12.0).

    First, launch an EKS Cluster in Dataiku. Once the cluster is set up, you can check its status and configuration in the Clusters Tab under Administration.

    Screenshot of the Admin Dashboard to manage EKS Clusters in Dataiku.
    Figure 12. Dataiku dashboard of configured clusters

    BERTopic

    Create a Dataiku code environment with BERTopic and its required packages using Dataiku’s managed virtual code environments.

    RAPIDS

    Build a container environment using the RAPIDS image from Docker Hub. In Dataiku, use either the Dataiku base image for your code environment, or download custom container images from DockerHub or NGC. Then, attach your Dataiku code environment to it. Note that NVIDIA has released a RAPIDS on PyPi, so you can now just use the default Dataiku base image.

    Step 3. Running BERTopic with default UMAP 

    Next, use BERTopic to source the top five topics from the Twitter complaints. To accelerate the UMAP process on GPU, use cuML UMAP.??The default UMAP is provided below:

    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    from bertopic import BERTopic
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Read the train dataset in the dataframe and the variable sample_size which defines the number of records to be used
    sample_size = dataiku.get_custom_variables()["sample_size"]
    train_data = dataiku.Dataset("train_cleaned")
    train_data_df = train_data.get_dataframe(sampling='head',limit=sample_size)
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Create Bertopic object and run fit transform
    topic_model = BERTopic(calculate_probabilities=True,nr_topics=4)
    topics, probs = topic_model.fit_transform(train_data_df["Review Description_cleaned"])
    all_topics_rapids_df = topic_model.get_topic_info()
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    #Write the List of Topics output as a DSS Dataset
    
    Topic_Model_df = all_topics_rapids_df 
    Topic_Model_w_Rapids = dataiku.Dataset("Topic_Model")
    Topic_Model_w_Rapids.write_with_schema(Topic_Model_df)
    
    
    RAPIDS cuML UMAP:
    
    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    from bertopic import BERTopic
    from cuml.manifold import UMAP
    from cuml.cluster.hdbscan.prediction import approximate_predict
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Read the train dataset in the dataframe and the variable sample_size which defines the number of records to be used
    sample_size = dataiku.get_custom_variables()["sample_size"]
    train_data = dataiku.Dataset("train_cleaned")
    train_data_df = train_data.get_dataframe(sampling='head',limit=sample_size)
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Create a cuML UMAP Obejct and pass it in the Bertopic object and run fit transform
    umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
    cu_topic_model = BERTopic(calculate_probabilities=True,umap_model=umap_model,nr_topics=4)
    cu_topics, cu_probs = cu_topic_model.fit_transform(train_data_df["Review Description_cleaned"])
    all_topics_rapids_df = cu_topic_model.get_topic_info()
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    # Write the List of Topics output as a DSS Dataset
    
    Topic_Model_w_Rapids_df = all_topics_rapids_df 
    Topic_Model_w_Rapids = dataiku.Dataset("Topic_Model_w_Rapids")
    Topic_Model_w_Rapids.write_with_schema(Topic_Model_w_Rapids_df)

    UMAP is a substantial contributor to overall compute time. Running UMAP on an NVIDIA GPU with RAPIDS cuML resulted in a 4x performance speedup. Additional improvement can be achieved by running more of the algorithm on GPU, such as with cuML HDBSCAN.

    Topic modeling process without/with RAPIDSRuntime
    Without RAPIDS12 minutes 21 seconds
    With RAPIDS2 minutes 59 seconds
    Table 1. Configuring with RAPIDS AI results in a 4x performance speedup

    Step 4. Complaint Clustering Dashboard

    Finally, you can build various cool-looking charts on the output datasets (with cleaned Tweet text and topics) in Dataiku and push to a dashboard for executive team review (Figure 13).

    Dataiku dashboard displaying a variety of metrics, including a pie chart, bar graph, and word cloud.
    Figure 13. Dataiku dashboard displaying a variety of metrics in one central location

    Putting it all together

    If you are looking to use deep learning for an image classification, object detection, or NLP use case, Dataiku helps you with labeling, model training, explainability, model deployment, and centralized management of code and code environments. Tight integrations with the latest NVIDIA data science libraries and hardware for compute make for a complete stack.

    Additional resources

    Check out the resources below to learn more.

    Discuss (0)
    +14

    Tags

    人人超碰97caoporen国产