• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Simulation / Modeling / Design

    Training Instance Segmentation Models Using Mask R-CNN on the NVIDIA TAO Toolkit

    To convert pixels to actionable insights, computer vision relies on deep learning to provide an understanding of the environment. Object detection is a commonly used technique to identify individual objects in a frame such as to identify people or cars. While object detection is beneficial for certain applications, it falls short when you want to understand objects at a pixel level.

    Instance segmentation is a popular computer vision technique that helps to identify each instance of multiple objects in a frame at the pixel level. In addition to the bounding boxes, instance segmentation also creates a fine-grained segmentation mask. Segmentation helps delineate between objects and background, such as in an AI-powered green screen, where you want to blur or change the background of the frame or segmenting roads or sky in the frame. Or you may want to pin-point manufacturing defects or nuclei segmentation in microscopy. Figure 1 shows an example of segmentation masks over detected objects.

    This image shows the bounding box and segmentation mask generated for the five cars detected on the streets.
    Figure 1. Instance segmentation: segmentation masks over detected objects.

    Transfer learning is a common practice in training specialized deep neural network (DNN) models. Transfer learning is made easier with NVIDIA TAO Toolkit, a zero-coding framework to train accurate and optimized DNN models. With the release of TAO Toolkit 2.0, NVIDIA added training support for instance segmentation, using Mask R-CNN. You can train Mask R-CNN models using one of the several ResNet backbones. Pretrained weights trained for ResNet10/18/50/101 are provided in the NGC and can be used as a starting point for transfer learning.

    In this post, I show you how to train a 90-class COCO Mask R-CNN model with TAO Toolkit and deploy it on the NVIDIA DeepStream SDK using TensorRT. You learn how to access and use pretrained models from NGC, train a Mask R-CNN model with minimal effort, and deploy it for inference on a GPU. These steps can be used to build any custom Mask R-CNN model.

    Mask R-CNN is natively integrated with the DeepStream SDK, a streaming analytic toolkit for building intelligent video analytic applications. For more information about how Mask R-CNN integrates with DeepStream, see Building Intelligent Video Analytics Apps Using NVIDIA DeepStream 5.0 (Updated for GA)

    Training a Mask R-CNN model using COCO

    Mask R-CNN is a two-stage, object detection and segmentation model introduced in 2017. It’s an excellent architecture due to its modular design and is suitable for various applications. In this section, I walk you through reproducible steps to take pretrained models from NGC and an open-source COCO dataset and then train and evaluate the model using TAO Toolkit.

    To get started, set up a NVIDIA NGC account and then pull the TAO Toolkit container:

    docker pull nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

    Next, download the pretrained model. Use the NGC command to list the available models:

    ngc registry model list nvidia/tlt_instance_segmentation:*

    To download the desired model, use the following command. For this post, I used the ResNet50 backbone, but you are free to use any of the supported backbones.

    ngc registry model download-version nvidia/tlt_instance_segmentation:resnet50 --dest $model_path

    The entire workflow consists of the following steps:

    1. Preparing the data.
    2. Configuring the spec file.
    3. Training the model.
    4. Verifying the model.
    5. Exporting the model.
    6. Deploying with DeepStream.

    Preparing the data

    Mask R-CNN expects a directory of images for training and validation and annotation in COCO format. TFRecords is used to manage the data and help iterate faster. To download the COCO dataset and convert it to TFRecords, the Mask R-CNN iPython notebook in the TAO Toolkit container provides a script called download_and_preprocess_coco.sh. If you are using a custom dataset, you must convert the annotation to COCO before using it with TAO Toolkit. For more information, see COCO data format.

    Download the COCO dataset and convert to TFRecords:

    bash download_and_preprocess_coco.sh $DATA_DIR

    This downloads the raw COCO17 dataset and converts it to TFRecords in $DATA_DIR.

    Configuring the spec file

    The next step is to configure the spec file for training. The experiment spec file is essential, as it compiles all the necessary hyperparameters for achieving a good model. The Mask R-CNN spec file has three major components: top-level experiment configs, data_config, and maskrcnn_config. The format of the spec file is a protobuf text (prototxt) message, and each of its fields can be either a basic data type or a nested message.

    The top-level experiment configs include basic parameters for an experiment; for example, learning rate, number of iterations, whether to use mixed-precision training, and so on. An encrypted checkpoint is saved per the num_steps_per_eval value and then you run evaluation on the validation set.

    The init_learning_rate value is set here for an 8-GPU training job. If you are using a different number of GPUs, adjust the learning rate following the linear scaling rule.

    use_amp: False
    warmup_steps: 1000
    checkpoint: "$PRETRAINED_MODEL_PATH"
    learning_rate_steps: "[60000, 80000, 90000]"
    learning_rate_decay_levels: "[0.1, 0.01, 0.001]"
    total_steps: 100000
    train_batch_size: 3
    eval_batch_size: 8
    num_steps_per_eval: 10000
    momentum: 0.9
    l2_weight_decay: 0.00002
    warmup_learning_rate: 0.0001
    init_learning_rate: 0.02

    The data_config value specifies the input data source and dimension. The augment_input_data is only applied during training and is recommended to achieve higher accuracy. The num_classes value is the number of categories in your ground truth plus 1 for the background class. Input images are resized and padded to image_size while keeping the aspect ratio.

    data_config{
            image_size: "(832, 1344)"
            augment_input_data: True
            eval_samples: 5000
            training_file_pattern: "
    $DATA_DIR/train*.tfrecord"
            validation_file_pattern: "$DATA_DIR/val*.tfrecord"
            val_json_file: "$DATA_DIR/annotations/instances_val2017.json"
            num_classes: 91
            skip_crowd_during_training: True
    }

    The maskrcnn_config value specifies the model structure and loss function–related hyperparameters. Currently, Mask R-CNN supports all ResNet backbones in TAO Toolkit. In this experiment, you choose ResNet50 as the backbone, its first two convolutional blocks frozen, and all batch normalization (BN) layers frozen, as specified with freeze_bn: True and freeze_blocks: "[0,1]". With a frozen convolutional layer, the weights do not change in a downstream task. This is especially helpful in transfer learning, where generic features are already captured in the shallow layers. You not only reuse the learned features, but also reduce training time. For more information about each field, see the TAO Toolkit Getting Started Guide.

    maskrcnn_config {
            nlayers: 50
            arch: "resnet"
            freeze_bn: True
            freeze_blocks: "[0,1]"
            gt_mask_size: 112
           
            # Region Proposal Network
            rpn_positive_overlap: 0.7
            rpn_negative_overlap: 0.3
            rpn_batch_size_per_im: 256
            rpn_fg_fraction: 0.5
            rpn_min_size: 0.
     
            # Proposal layer.
            batch_size_per_im: 512
            fg_fraction: 0.25
            fg_thresh: 0.5
            bg_thresh_hi: 0.5
            bg_thresh_lo: 0.
     
            # Faster-RCNN heads.
            fast_rcnn_mlp_head_dim: 1024
            bbox_reg_weights: "(10., 10., 5., 5.)"
     
            # Mask-RCNN heads.
            include_mask: True
            mrcnn_resolution: 28
     
            # training
            train_rpn_pre_nms_topn: 2000
            train_rpn_post_nms_topn: 1000
            train_rpn_nms_threshold: 0.7
     
            # evaluation
            test_detections_per_image: 100
            test_nms: 0.5
            test_rpn_pre_nms_topn: 1000
            test_rpn_post_nms_topn: 1000
            test_rpn_nms_thresh: 0.7
     
            # model architecture
            min_level: 2
            max_level: 6
            num_scales: 1
            aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
            anchor_scale: 8
     
            # localization loss
            rpn_box_loss_weight: 1.0
            fast_rcnn_box_loss_weight: 1.0
            mrcnn_weight_loss_mask: 1.0
    }

    Training the model

    When the data and spec file are ready, you can start the training with the following command:

    tlt-train mask_rcnn -e $spec_file_path -r $experiment_dir -k $KEY --gpus N

    Training with more GPUs allows networks to ingest more data faster, saving you precious time during the development process. TAO Toolkit supports multi-GPU training so that you can train the model with several GPUs in parallel. If automatic mixed-precision (AMP) training is enabled by setting enable_amp to True, you can expect a 20–50% speed boost compared to F32 training. During training, a detailed log records training loss every five iterations and evaluation metrics on the validation set.

    throughput: 34.4 samples/sec
    ==================== Metrics =====================
    FastRCNN box loss: 0.27979
    FastRCNN class loss: 0.11633
    FastRCNN total loss: 0.39612
    L2 loss: 0.83087
    Learning rate: 0.00014
    Mask loss: 1.3277
    RPN box loss: 0.03868
    RPN score loss: 0.60576
    RPN total loss: 0.64443
    Total loss: 3.19912

    If, for any reason, the training process is interrupted, you can resume training by executing the same command. It automatically picks up from the last saved checkpoint.

    Evaluating the model

    To evaluate the model, use the following command:

    tlt-evaluate mask_rcnn -e $spec_file_path -m $model_path -k $KEY

    Mask R-CNN reports COCO’s detection evaluation metrics. For example, AP50 means the average precision (AP) when IoU is set at 50%.

    All the detection frameworks use mAP as a shared metric, adopted from Pascal VOC and comparable to AP50. The classification model enables various metrics, including Top K Accuracy, Precision and Recall, and Confusion Matrix.

    After training for 100K iterations using 8 GPUs, you can observe the following metrics:

    =========== Metrics ===========
    AP: 0.334154785
    AP50: 0.539312243
    AP75: 0.358969182
    APl: 0.453923374
    APm: 0.354732722
    APs: 0.181649670
    ARl: 0.661920488
    ARm: 0.533207536
    ARmax1: 0.297426522
    ARmax10: 0.477609098
    ARmax100: 0.503548384
    ARs: 0.317135185
    mask_AP: 0.307278961
    mask_AP50: 0.505144179
    mask_AP75: 0.325496018
    mask_APl: 0.432014465
    mask_APm: 0.327025950
    mask_APs: 0.151430994
    mask_ARl: 0.626315355
    mask_ARm: 0.492682129
    mask_ARmax1: 0.281772077
    mask_ARmax10: 0.439913362
    mask_ARmax100: 0.461205393
    mask_ARs: 0.271702766

    The KPI is obtained by fine-tuning the pretrained model from NGC, which was originally trained on a subset of the Open Images dataset. If you train using ImageNet pretrained weights or train with larger iterations, the KPI might be different.

    Verifying the model

    Now that you’ve trained the model, run inference and verify the predictions. To verify the model visually with TAO Toolkit, use the tlt-infer command. The tlt-infer command supports inference on the .tlt model, as well as the TensorRT engine. tlt-infer generates annotated images with bounding boxes rendered on them. Optionally, you can also visualize the segmentation masks or serialize the output metadata in COCO JSON format. For example, to run an inference with a .tlt file, run the following command:

    tlt-infer mask_rcnn  -i $input_images_dir  -o $annotated_images_dir -e $spec_file -m $tlt_model -l $json_label -t $threshold  --include_mask 

    Figure 2 shows an example of the original image compared to the annotated output shown in Figure 3. As you can see, the model is robust to an image that is different from the COCO training data.

    This image shows a scene on the street with cars before we run the inference.
    Figure 2. The original image.
    This annotated image shows a scene on the street with cars with predicted bounding boxes and instance masks generated after we run the inference.
    Figure 3. Annotated output of the original image: inference using the trained model.

    Exporting the model

    Inference throughput and how quickly you can create an efficient model are two key metrics for deploying deep learning applications because they directly affect the time to market and the cost of deployment. TAO Toolkit includes a tlt-export command to export and prepare TAO Toolkit models for deployment. The tlt-export command can optionally generate the calibration cache for running inference at INT8 precision. For more information, see Deploying Deep Neural Networks with NVIDIA TensorRT.

    The model is exported as a .etlt (encrypted TLT) file. The file is consumable by the DeepStream SDK, which decrypts the model and converts it to a TensorRT engine. Exporting the model decouples the training process from inference and allows conversion to TensorRT engines outside the TAO Toolkit environment. TensorRT engines are specific to each hardware configuration and should be generated for each unique inference environment.

    For example, to export the model in INT8, use the following command:

    tlt-export mask_rcnn -m $model_path -o $int8_etlt_file -e $spec_file -k $KEY --cal_image_dir $calibration_image_dir --batch_size N  --batches $num_cal_batches --cal_cache_file $calibration_table --cal_data_file $calibration_data_cache --data_type int8

    This generates an INT8 calibration table and .etlt file. To quantize the model to INT8, you must provide a dataset on which to calibrate, which is provided by the --cal_image_dir and --cal_data_file arguments. These arguments specify the image directory and the tensorfile needed for calibration. The number of batches in the tensorfile is obtained from batches and batch_size values. Make sure that the directory mentioned in --cal_image_dir has at least (batch_size * batches) images in it.

    Deploying with DeepStream

    Integrating a Mask R-CNN model in DeepStream is straightforward, as DeepStream 5.0 supports instance segmentation networks by default. The configuration file and label file for the model are provided in the SDK. These files can be used with the generated model as well as your own trained model. A sample Mask R-CNN model trained on a one-class dataset is provided in GitHub. The config and label file should work by default for that model. For the model that you trained in this post, minor modifications are required.

    Start by downloading and installing the DeepStream SDK. The instructions to run TAO Toolkit models with DeepStream are provided in the README:

    /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models

    Here are the key config files to run the Mask R-CNN model:

    /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/deepstream_app_source1_mrcnn.txt
    
    /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/config_infer_primary_mrcnn.txt

    The /deepstream_app_source1_mrcnn.txt file is the main config used by deepstream-app. This file configures the parameter for the entire video analytic pipeline. For more information, see Reference Application Configuration. For more information about the Mask R-CNN inference pipeline in DeepStream, see Building Intelligent Video Analytics Apps Using NVIDIA DeepStream 5.0 (Updated for GA).

    The /config_infer_primary_mrcnn.txt file is an inference config file that sets parameters for Mask R-CNN inference. This file is called by the main deepstream_app_source1_mrcnn.txt config. Here are the key parameters that you modify based on your model:

    tlt-model-key=
    tlt-encoded-model=
    labelfile-path=
    int8-calib-file=
    infer-dims=
    num-detected-classes=<# of classes if different than default>

    Here is an example:

    [property]
    gpu-id=0
    net-scale-factor=0.017507
    offsets=123.675;116.280;103.53
    model-color-format=0
    tlt-model-key=
    tlt-encoded-model=
    output-blob-names=generate_detections;mask_head/mask_fcn_logits/BiasAdd
    parse-bbox-instance-mask-func-name=NvDsInferParseCustomMrcnnTLT
    custom-lib-path=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_infercustomparser.so
    network-type=3 ## 3 is for instance segmentation network
    labelfile-path=
    int8-calib-file=
    infer-dims=
    num-detected-classes=<# of classes if different than default>
    uff-input-blob-name=Input
    batch-size=1
    0=FP32, 1=INT8, 2=FP16 mode
    network-mode=2
    interval=0
    gie-unique-id=1
    no cluster
    0=Group Rectangles, 1=DBSCAN, 2=NMS, 3= DBSCAN+NMS Hybrid, 4 = None(No clustering)
    MRCNN supports only cluster-mode=4; Clustering is done by the model itself
    cluster-mode=4
    output-instance-mask=1

    This runs on the clip that is provided in the SDK. To try on your own source, modify [source0] in /deepstream_app_source1_mrcnn.txt.

    Figure 4 shows the end-to-end performance that you can expect with deepstream-app across various platforms. The performance is measured in frames per sec (FPS) that are processed by deepstream-app.

    • Ran with inference resolution of 1344×832
    • On NVIDIA Jetson Nano and on DLAs, this is run with a batch size of 1.
    • On Jetson AGX Xavier and Xavier NX, this is run with a batch size of 2.
    • On T4, this is run with a batch size of 4.
    Figure 4. Performance of the Mask R-CNN model with DeepStream.

    Conclusion

    In this post, you learned about training instance segmentation models using the Mask R-CNN architecture with TAO Toolkit. The post showed taking an open-source COCO dataset with one of the pretrained models from NGC and training and optimizing with TAO Toolkit to deploying the model on the edge using the DeepStream SDK.

    You can apply these steps to train and deploy your own custom networks. Training can be done on multiple GPUs to parallelize the runs and train faster. You can also generate INT8 calibration files to run inference at INT8 precision. Running at INT8 precision improves inference performance on your edge devices.

    For more information, see the following resources:

    Discuss (3)
    +1

    Tags

    人人超碰97caoporen国产