• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Editor’s Note: This post has been updated. Here is the revised post.

    Training deep learning models with vast amounts of data is necessary to achieve accurate results. Data in the wild, or even prepared data sets, is usually not in the form that can be directly fed into neural network. This is where NVIDIA DALI data preprocessing comes into play.

    There are various reasons for that:

    • Different storage formats
    • Compression
    • Data format and size may be incompatible
    • Limited amount of high quality data

    Addressing the above issues requires your training pipeline provide extensive data preprocessing capabilities, such as loading, decoding, decompression, data augmentation, format conversion, and resizing. You may have used the native implementation in existing machine learning frameworks, such as Tensorflow, Pytorch, MXnet, and others, for these pre-processing steps. However, this creates portability issues due to use of framework-specific data format, set of available transformations, and their implementations. ?Training in a truly portable fashion needs augmentations and portability in the data pipeline.

    The CPU bottleneck

    Data preprocessing for deep learning workloads has garnered little attention until recently, eclipsed by the tremendous computational resources required for training complex models. As such, preprocessing tasks ?typically ran on the CPU due to simplicity, flexibility, and availability of libraries such as OpenCV or Pillow.

    Recent advances in GPU architectures introduced in the NVIDIA Volta and Turing architectures, have significantly increased GPU throughput in deep learning tasks. In particular, ?half-precision arithmetic and Tensor Cores accelerate certain types of FP16 matrix calculations useful for training DNNs. Dense multi-GPU systems like NVIDIA’s DGX-1 and DGX-2?train a model much faster than data can be provided by the processing framework, leaving the GPUs starved for data.

    Today’s DL applications include complex, multi-stage data processing pipelines consisting of ?many serial operations. To rely on the CPU to handle these pipelines limits your performance and scalability.

    DALI?To the Rescue

    NVIDIA Data Loading Library (DALI) is a result of our efforts?find a scalable and portable solution to the data pipeline issues mentioned above. DALI is a set of highly optimized building blocks plus an execution engine to accelerate input data pre-processing?for deep learning applications, as diagrammed in figure 1. DALI provides performance and flexibility for accelerating different data pipelines.

    Dali in the DL training pipeline diagram
    Figure 1. DALI overview and it’s presence in DL training pipeline

    DALI currently supports computer vision tasks such as image classification, recognition and object detection. It also supports H.264 and HVEC decoding for video data. Additional features, such as medical volumetric data and inference pre and post processing may be supported in future versions.

    Since new networks and augmentations appear every day, DALI’s plugin manager provides an easy way to extend existing functionality. Custom operators?can be implemented, compiled, and loaded separately into the DALI.

    DALI provides portability of entire pipelines between different DL frameworks, as shown in figure 2. Range of reader operators allow using data containers unsupported natively by chosen DL framework. For example, you can use the LMDB data set in the MXNet or TensorFlow based networks.

    DALI interoperability diagram
    Figure 2. DALI can be fed by variety of data sources

    DALI offers drop-in integration of your data pipeline into different Deep Learning frameworks – simple one-liner plugins wrapping DALI pipeline are available (TensorFlow, MXNet and PyTorch). In addition, you will be able to reuse pre-processing implementations between these deep learning frameworks

    Lastly, since DALI is open-source, you will be able to readily customize and adapt it to suit the data pre-processing needs for a variety of training pipelines.

    DALI Key features

    DALI offers a simple Python interface where you can implement a data processing pipeline in a few steps:

    1. Select Operators from this extensive list of supported operators?
    2. Define the operation flow as a symbolic graph in an imperative way (as in most of the current deep learning frameworks)
    3. Build an operation pipeline
    4. Run graph on demand
    5. Integrate with your target deep learning framework by dedicated plugin

    Let us now deep dive into the inner working of DALI, followed by how to use it.

    DALI inner workings

    DALI defines data pre-processing pipeline as a dataflow graph, with each node representing a data processing Operator.?DALI has 3 types of Operators as follows:

    • CPU: accepts and produces data on CPU
    • Mixed: accepts data from CPU and produces the output at the GPU side
    • GPU: accepts and produces data on the GPU

    Although DALI is developed mostly with GPUs in mind, it also provides a variety of CPU-operator variants. This enables utilizing available CPU cycles for use cases where the CPU/GPU ratio is high or network traffic completely consumes available GPU cycles. You should experiment with CPU/GPU operator placement to find the sweet spot.

    For the performance reasons, DALI only transfers data from CPU->Mixed->GPU as shown in figure 3.

    Dali example pipeline diagram
    Figure 3. Dali example pipeline

    Existing frameworks offer prefetching, which calculates necessary data fetches before they’re needed. DALI prefetches?transparently,?providing?the ability to define prefetch queue length flexibly during pipeline construction, as shown in ?figure 4. This makes it?straightforward to hide high variation in the batch-to-batch processing time.

    Data processing overlaps training diagram
    Figure 4. How data processing overlaps with training

    How to use DALI

    As mentioned above, DALI follows a graph-based execution model. The following example shows?you how to define, build,?and run simple pipeline using?the?Python API.?

    DALI Python API

    The central feature?of the DALI Python API is the Pipeline?class. You need to create your own subclass of Pipeline?by instantiating?desired operators and define the connections between them.

    class SimplePipeline(Pipeline):
        def __init__(self, batch_size, num_threads, device_id):
            super(SimplePipeline, self).__init__(batch_size, num_threads, device_id)
            self.input = ops.FileReader(file_root = image_dir)
            self.decode = ops.HostDecoder(output_type = types.RGB)
    
        def define_graph(self):
            jpegs, labels = self.input()
            images = self.decode(jpegs)
            return (images, labels)

    You only need to write two methods:

    • __init__: Choose the operators (you can find them in nvidia.dali.ops module). This simple pipeline uses only two operators, FileReader to read files from the drive and HostDecoder to decode images to RGB format. You also need to pass the following parameters to super: batch size (Pipeline handles batching data for you), number of worker threads you wish to use, and ID of the GPU device employed for the job.
    • define_graph: Define computation execution by connecting operators together. Obtain images as jpegs with corresponding labels from FileReader, pass to decoder, and return decoded data with labels as output from the pipeline.?

    The next step is to create the SimplePipeline?object and build?it to actually construct?a?graph of operations.

    pipe = SimplePipeline(batch_size, 1, 0)
    pipe.build()

    By this point, the pipeline is ready to use. You can obtain batch of data by calling?the?run?method.

    images, labels = pipe.run()

    Randomly shuffling?the dataset?is required to make it usable for neural network training.?You?set the seed?parameter of the super?method and set random_shuffle?to true?in FileReader?to do the job:

    def __init__(self, batch_size, num_threads, device_id):
    ? ? ? ?super(SimplePipeline, self).__init__(batch_size, num_threads, device_id, seed = 12)
    ? ? ? ?self.input = ops.FileReader(file_root = image_dir, random_shuffle = True)
    ? ? ? ?self.decode = ops.HostDecoder(output_type = types.RGB)

    Now ?let’s add some actual data augmentation. We will rotate each image by random angle. For random angle generation, you can use?the?Uniform?operator, and?the?rotate?operator for the rotation:

    class SimplePipeline(Pipeline):
    ? ?def __init__(self, batch_size, num_threads, device_id):
    ? ? ? ?super(SimplePipeline, self).__init__(batch_size, num_threads, device_id, seed = 12)
    ? ? ? ?self.input = ops.FileReader(file_root = image_dir, random_shuffle = True)
    ? ? ? ?self.decode = ops.HostDecoder(output_type = types.RGB)
    ? ? ? ?self.rotate = ops.Rotate()
    ? ? ? ?self.rng = ops.Uniform(range = (-10.0, 10.0))
    ? ?def define_graph(self):
    ? ? ? ?jpegs, labels = self.input()
    ? ? ? ?images = self.decode(jpegs)
    ? ? ? ?angle = self.rng()
    ? ? ? ?rotated_images = self.rotate(images, angle = angle)
    ? ? ? ?return (rotated_images, labels)

    Figure 5 shows some examples of what occurs when applying these operations.

    Image rotation examples
    Figure 5. Here’s how the output might appear

    You can look at the?Getting started example?for more information.

    Frameworks integration

    Seamless interoperability with different deep learning frameworks represents one of the best features of DALI. For example, if you wish to use your pipeline with PyTorch model, you can easily do so by wrapping it with the DALIClassificationIterator.

    train_loader = DALIClassificationIterator(
    ? ? ?pipe,
    ? ? ?size=int(pipe.epoch_size("Reader")))

    During training, you can enumerate over train_loader?and feed your model with data.

    for i, data in enumerate(train_loader):
    ? ? ? ?images = data[0]["data"]
    ? ? ? ?labels = data[0]["label"].squeeze().cuda().long()
    ? ? ? ?# model training

    If you need something more generic (such as?more outputs), DALIGenericIterator?has you covered. For more information and examples with other frameworks (MXNet and Tensorflow), take a look at the Framework integration section?of DALI docs.

    Offloading computation to GPU

    The last thing we introduce to SimplePipeline?is using the GPU to perform?augmentations. DALI makes this transition as smooth as possible. The only thing that changes in __init__?method is creation of the rotate?op.

    self.rotate = ops.Rotate(device = "gpu")

    In define_graph, you need to make sure, that inputs to rotate?reside on the GPU rather than the CPU.

    rotated_images = self.rotate(images.gpu(), angle = angle)

    That’s it. With those changes, SimplePipeline?performs?the rotations on the GPU. Keep in mind that resulting images are also allocated in the GPU memory, which is generally not a ?problem since you probably end up copying them to GPU memory anyway. If not, after running pipeline, you can call asCPU?on the images?object to copy them back.

    For more information on how to use DALI with GPU, take a look at our augmentation gallery example.

    Other examples

    We prepared numerous?examples and tutorials on using?DALI in different contexts. For instance, if you would like to know how it can be integrated into proper model training, the?ResNet50 training script?in DALI docs?shows this. The documentation?covers every step you need to take to use DALI in training ResNet50 on ImageNet. It also shows?you how to spread training among multiple GPUs when using DALI.

    Another example shows you how to read data in a format unsupported by DALI by implementing a custom input?with ExternalSource?operator.

    DALI performance numbers

    NVIDIA?showcases DALI in our?implementations of SSD and ResNet-50?since?it was one of the contributing factors in MLPerf?benchmark success.?

    Figure 6 compares DALI with the RN50 network running with the different GPU configurations:

    Resnet and MXNet training performance chart
    Figure 6. Note how?the core/GPU ratio becomes smaller (DGX1V has 5 CPU cores per GPU, while DGX2 only 3) the performance improvement gets better.

    Get started with DALI today

    You can download the latest version of prebuilt and tested DALI pip packages. The NVIDIA GPU Cloud (NGC) Containers for Tensorflow, Pytorch?and MXNet?have DALI?integrated. You can review the many examples?and read the latest release notes?for a detailed list of new features and enhancements.

    See how DALI can help you accelerate data pre-processing for your deep learning applications. ?The source code is available on GitHub. We welcome your feedback and code contributions.

    If you are interested in learning more about DALI, listen to our talk?from GTC 2018.

    Discuss (0)
    0

    Tags

    人人超碰97caoporen国产