Data Management and Storage for Autonomous-Vehicle Developers

Dell EMC helps automotive companies pursue new data-driven business opportunities in the Digital Age by offering massively-scalable, easily managed high-performance storage systems that can support both traditional workflows and data-intensive emerging workflows required for ADAS projects, including AI/ML/DL.

Dell EMC chief technology officer Dr. Florian Baumann, Ph.D, discussed with SAE’s Autonomous Vehicle Engineering the important role of storage systems and data management in the face of growing data volumes and increasing performance requirements.

AVE: What benefits can automotive companies gain by using remote storage?

Baumann: A remote data lake increases engineering productivity and reduces license costs, as well as infrastructure costs. Automotive customers are storing data in a centralized data lake located remotely instead of having CPUs, GPUs and storage nodes at different sites, each with their own software licenses. Compute jobs can be started locally at the remote data lake.

AVE: How do GPUs and artificial intelligence change user requirements?

Baumann: GPUs massively enhance the training and inferencing performance of artificial intelligence (AI) systems and deep-learning algorithms. Further, they are providing the vector math to offload the workload from the CPU. Deploying AI assumes new user requirements, including an end-to-end training toolchain. Efficient storage management is critical for gaining the full benefits that GPUs offer.

AVE: Can remote storage meet demanding access times?

Baumann: A well-designed centralized data lake architecture that’s located remotely can meet demanding access times and bandwidth requirements. Data ingestion from the vehicle to the R&D center and then to the remote data lake is a prominent and very challenging task in the development of ADAS and autonomous-driving systems.

A vehicle’s data collection can generate up to 100 terabytes per day. To meet the access times and counter limited bandwidth on wide area network (WAN) lines, a local cache serves as a buffer before data is moved to the remote data lake. Typical service-level agreements are that data must be offloaded from the vehicle in less than four hours and ingested into the remote data lake in less than 24 hours. Tools such as UDP file acceleration help to fully utilize the WAN line.

And a combination between on-premises infrastructure and public cloud architecture — also referred to as Hybrid Cloud — can help to counter peak workloads.

AVE: Does the expansion from petabytes to exabytes impact data integrity?

Baumann: Moving from petabyte scale to exabyte scale requires a well-established process for data management as well as meta-data management. Files, objects and sensor metadata (city, road surface, weather, light level, traffic level, etc.) should be registered and tracked in a database, with its location on the storage. Further, it is crucial that the performance, storage and database capabilities all scale to meet larger data volumes without impacting legacy tests, information, or infrastructure management complexity.

AVE: What storage-management techniques are used to enhance data mining and analysis?

Baumann: To enhance data mining, analytics and the training of machine-learning algorithms, data-management systems must be in place. A data-management system is software that receives sensor data as well as metadata. If we choose the task of object detection, sensor data is the image, while metadata are bounding boxes around the objects that should be detected. A data-management system serves as a control unit to which developers can connect to access and locate data and metadata to train the machine-learning algorithms.

AVE: How can data be transferred to storage repositories?

Baumann: Data is usually transferred virtually through WAN lines from the remote site to the central data lake. Alternatively, storage cartridges are physically sent by postal service. The virtual data transfer and ingest is done using UDP file acceleration; a 10-gigabyte file can be transferred from U.S. to Europe through a 1 Gbit WAN line in 1-2 minutes.

Additionally, data can be compressed or data-cleaning methods applied before transferring the data. Data-cleaning methods are specific algorithms applied to the data to identify meaningful data that can be used to train the final system.