OS2G Pre
DPU
![[attachments/Pasted image 20251012142019.png]]
Key components of a typical DPU:
- PCIe switch enables the DPU to access host resources, including host memory, GPUs.
- Network interface enables DPU to efficiently manage network traffic at high throughput.
- Multi-core processors and onboard memory
- Hardware accelerators for specialized functions such as encryption, decryption, data compression and storage.
Summary: DPU can offload various data processing tasks, such as:
- Network Offload
- Storage services
- Encryption & Decryption
- NIC functions, like DMA and GPUDirect/RDMA
Object Storage Clients For Deep Learning
Deep learning applications usually read data through POSIX file system interfaces, while object storage systems use HTTP RESTful APIs.
Two Main Approach:
- POSIX-to-REST translation: Translate filesystem operations(e.g. open, read, write) into HTTP requests, access remote data as if were local files. Tools like S3FS
- Direct Integration: Applications can directly use object storage SDKs or some libraries which abstract the REST API calls. Tools like Boto3 and Amazon S3 Connector for PyTorch.
Fuse & Virtofs
![[attachments/Pasted image 20251013125518.png]] ![[attachments/Pasted image 20251013125528.png]] Under FUSE, when an application initiates a file system request, the kernel encapsulates it as a FUSE request containing details such as the file path, operation type, and data. The kernel then forwards this request to the FUSE program running in user space. The FUSE program processes the request (for example, reading from a remote source) and sends a response back to the kernel, which finally delivers the result to the application.
Application
↓
Linux Kernel
↓ (FUSE kernel module)
User-space FUSE program (e.g., s3fs)
↓
Real data source (e.g., S3 object storage)
Virtiofs is a file system for virtualized environments that builds on the FUSE framework.
[VM App]
↓ POSIX calls
[FUSE kernel module in VM]
↓
[Virtio queue (shared memory link)]
↓
[Virtiofs daemon on host]
↓
[Host file system (ext4, xfs, etc.)]
Motivation
- Object storage clients uses the HTTP/HTTPS protocol, relying on TCP network and consume host CPU resources.
- RDMA can bypass the CPU and kernel network stack, greatly reducing overhead. However, RDMA does not support HTTP/HTTPS, which are the protocols used by object storage systems (like S3).
- The object storage protocol need to be parsed, like usable file or stream format, which adds extra CPU cost.
In summary, using object storage in deep learning applications leads to significant CPU consumption due to network and storage operations. This motivates authors to offload the object storage client to the DPU.
Navie Offloading
![[attachments/Pasted image 20251012162420.png]]
A straightforward way to offload the object storage client to the DPU is by using Virtiofs.
However, this design suffers from significant performance degradation because each operation involves frequent context switching between kernel and user space, as well as data copying between the Virtiofs handler and other components.
So to get rid of the context switching in DPU, the designed a new OS2G Client can interact with Virtiofs directly without the kernel FUSE interface.
![[attachments/Pasted image 20251012164647.png]]
Data Transfer
For GPU-based DL applications, if we don’t transfer the data from DPU memory to GPU memory directly, we have to first copy data from DPU to CPU, and to GPU, then.
Therefore, we can use GPUDirect Storage technology to transfer data from DPU memory to GPU memory through DMA Engine. Here DPU memory works as a data buffer.
Note: NVMe devices can directly transfer data to GPUs. Data from distributed file systems can be DMA-transferred from RDMA-based NIC to GPU. But the HTTP/HTTPS protocol is needed in object storage systems, so there is no current solution support to transfer directly without DPU data cache. ![[attachments/Pasted image 20251013130333.png]]
Traditional:
Remote Object Storage → Host CPU Memory → GPU
↑ TCP/HTTP + FUSE
Using DPU:
Remote Object Storage → DPU Memory → (DMA) → GPU
↑ HTTP Parse & S3 Protocol
Using DPU without using DMA:
Remote Object Storage → DPU Memory → Host CPU Memory → GPU
↑ HTTP Parse & S3 Protocol
Summary:
- Authors offload the storage client to DPU to reduce CPU pressure.
- Naive offloading results in performance degradation and data movement problem(DPU->CPU->GPU).
- They designed a new storage client server run on DPU, without kernel involved to avoid context switch.
- And they transfer data from DPU to GPU directly via DMA without CPU involved.
Overall architecture
![[attachments/Pasted image 20251013025713.png]]
- The host initiates request to read object storage data
- Data processing and Data Transferring is handled by the DPU entirely.
- The Data is transferred from DPU memory to the GPU without host intervaention.
GDD Driver:
- GDD Driver will translate the GPU virtual address into host physical address.
- Add the host physical address to the data passed to DPU.
- DPU uses this data to drive its DMA engine.
OS2G Driver:
- OS2G-FS Driver is a file-system driver registered under the host VFS that reuses the virtiofs framework. It encapsulates the host’s file-system requests for object-storage access into FUSE messages and forwards them to the DPU via the virtiofs device queue.
OS2G Client:
- It’s a Object storage client that abandons the kernel FUSE interface.
- OS2G client’s data path is optimized employing asynchrony, pre-reading and concurrency strategies.
![[attachments/Pasted image 20251013094815.png]]
I/O Data Path
- DL application allocate GPU memory and send rend request with GPU memory address.
- OS2G-FS Driver encapsulates the file operation request into the FUSE format and enqueue it to the virtiofs queue. (virtiofs queue is a shared memory can be accessed by host and the DPU. And the buffer for request and result is stored in a ring buffer for continuous and asynchronous request)
- Virtiofs handler polls virtiofs device for new requests and parse the FUSE protocol
- Send the FUSE request to the OS2G Client
- OS2G client initiate a request to fetch data from server and parse the response in the data buffer, then return the processed data to the virtiofs handler. Virtiofs handler knows the GPU address to put the data.
- Virtiofs handler configure the DMA engine, and DMA engine transfers the data to the GPU memory.
- Virtiofs Device in DPU triggers a PCIe interrupt to notify the host.
![[attachments/Pasted image 20251013095433.png]]
High-Performance OS2G Client
- Blocking: The full rquest are divided into multiple small block request. Each block is in size of 128 KB.
- Asynchrony: The ring buffer ensures non-blocking interaction between DPU and host.
- Pre-reading: Upon receiving the read request for a file, the OS2G client loads the entire file into its Data Buffer.
- Reading Concurrency: While reading, the OS2G client will divide the reading request into several blocks and read the data concurrently.
GDD
- Allocate a GPU memory and get GPU_VA
- Allocate VA on host to map to GPU_VA
- Read(VA), and PA is mapped to VA
- Pass VA/(PA) to the OS2G-FS Driver.
Evaluation
![[attachments/Pasted image 20251013090452.png]]
Wait time: waiting for data transfer (I/O/DMA)
Load Data:
CPU time spent fetching, parsing, and moving training data from object storage to GPU memory
OS2G-Host: a version runs the OS2G Client on the host to show the effect of offloading and GDD(GPU Direct DPU).
OS2G reduces the execution time of deep learning applications due to GDD.
The data loading task is offloaded from CPU
![[attachments/Pasted image 20251013092253.png]] This shows that:
- Offloading the OS2G Client from host to the DPU does not result in performance loss
- GDD(GPU Direct DPU) can optimize the data path, enhancing execution efficiency.
![[attachments/Pasted image 20251013091610.png]]
- This shows the optimization strategies (asynchrony, pre-reading, concurrent data reading) enable efficient processing.
- And when running multiple applications, OS2G demonstrates higher execution performance and significantly lower CPU consumption compared to the host-mode solution.