Accelerating Uploads and Downloads on the Hub: Optimizing Data Transfer with Xet-backed Content-Defined Chunking

2025-02-12

In the world of machine learning and AI development, efficiently managing large datasets is a key challenge. Uploading, downloading, and transferring massive repositories can cause delays and slow down productivity. To address these challenges, Hugging Face’s Xet team has introduced content-defined chunking (CDC) to enhance the speed of data transfers on the Hub, significantly improving the user experience. This article dives into how CDC works and how it’s helping AI builders iterate faster and collaborate more efficiently.

From Chunks to Blocks: Enhancing the Speed of Data Movement on the Hub

Content-defined chunking (CDC) is an approach that underpins deduplication in repositories backed by Xet, where data is broken down into smaller chunks, and only unique chunks are stored. In theory, this helps save space and speeds up data transfers, but the practice is far more complex. The goal isn’t just deduplication but optimizing the entire upload and download experience, improving the speed of collaboration and experimentation for AI teams. The challenge lies in balancing the efficiency of deduplication with the need to scale without overwhelming network and infrastructure resources.

Scaling Deduplication with Aggregation

Uploading large repositories, like a 200GB model, can result in millions of chunk entries in a content-addressed store (CAS). At Hugging Face, managing this scale is challenging, as it can lead to network and infrastructure overheads, such as a high volume of requests and rising costs. A purely chunk-based approach, where each chunk is tracked individually, would generate billions of requests and be unsustainable.

The solution, therefore, is not merely to minimize chunks but to optimize their transfer and storage by introducing blocks and shards. Rather than transferring and storing each chunk individually, data is bundled into blocks of up to 64MB, reducing the volume of chunks in the system by a factor of 1,000. Shards map files to chunks, making it possible to track which parts of a file have changed without the need for redundant data transfers.

To further optimize this system, Hugging Face introduced key chunks, a small subset of chunks indexed globally. These chunks are key to improving deduplication and reducing unnecessary network queries.

Optimizing Uploads: A Real-World Example

For instance, when a repository like gemma-2-9b-it-GGUF (which contains multiple quantizations of the same model) is uploaded, the system takes advantage of overlap between chunks to avoid redundant storage. Instead of uploading 191GB of data, the optimized version, after deduplication, only requires 97GB. This is a massive reduction in storage requirements and a major speed improvement, cutting upload time by nearly half.

Similarly, the

What Undercode Says:

The approach taken by Hugging Face’s Xet team demonstrates a shift in focus from simply storing data efficiently to creating a seamless and efficient experience for developers. While deduplication is still an important part of the strategy, it’s now seen as one tool in a broader goal to speed up the entire workflow—from uploading to downloading and managing models.

The decision to move beyond a strict chunk-based model and introduce block aggregation is an important one, as it not only helps reduce the volume of data to be processed but also minimizes the network requests and infrastructure strain that would otherwise arise. This clever combination of blocks, shards, and key chunks ensures that data transfer remains scalable, even as the number of models and datasets on the Hub continues to grow exponentially.

Why Aggregation is Crucial for Scalability

Aggregation—grouping chunks into larger blocks—is a fundamental design choice that directly addresses the limitations of a purely chunk-based approach. If each chunk required its own network query, it would be impossible to scale this system to the millions of repositories currently hosted on the Hub. By bundling data into blocks, Hugging Face achieves a significant reduction in the number of requests made to the storage layer, which in turn lowers both the computational load and the financial cost associated with managing such large-scale data.

Another key aspect of the aggregation strategy is the use of shards to efficiently map file changes. Shards not only reduce the need for redundant uploads but also make it easy to track modifications in individual files, providing a streamlined experience for developers. By reducing the frequency of uploads and downloads, the system minimizes the time spent managing large datasets, enabling AI teams to focus on innovation rather than logistics.

Key Chunks and Local Deduplication: The Future of Data Transfers

The use of key chunks as a subset of data that is indexed globally is another innovative step in reducing the time and complexity of data transfer. This feature works hand-in-hand with the idea of local deduplication, which ensures that AI builders don’t have to repeatedly download the same chunks, thus dramatically improving download speeds.

By leveraging the principles of spatial locality, where related chunks are likely found in the same shard, Hugging Face improves deduplication even further. This approach ensures that chunks that appear in different repositories or versions of a model are efficiently reused, rather than being re-uploaded or re-downloaded, which reduces the cost of maintaining multiple versions of a model.

The Power of Optimization in AI Development

The real impact of these optimizations becomes clear when considering the workflows of AI developers. Every second saved in uploading or downloading models directly contributes to productivity, and reducing infrastructure overhead allows developers to dedicate more time to model creation, training, and experimentation. With Xet-backed repositories, Hugging Face ensures that the technology is designed not just for storage efficiency, but to foster an environment where rapid experimentation and collaboration can thrive.

As the field of AI continues to evolve, managing the scale and speed of data transfers will be increasingly important. Hugging Face’s approach to optimizing data flow on the Hub, through a combination of block aggregation, key chunk indexing, and local deduplication, provides a blueprint for how large-scale data repositories can be managed efficiently without sacrificing performance. The benefits of these optimizations are not only seen in storage but also in the real-time development cycle, where minimizing bottlenecks can lead to faster iteration and quicker deployment of models and datasets.

In conclusion, as more Xet-backed repositories are rolled out across the Hub, these improvements promise to make file transfers feel virtually invisible, enabling AI builders to focus on what matters most—creating and refining innovative models.

References:

Reported By: https://huggingface.co/blog/from-chunks-to-blocks
https://www.github.com
Wikipedia: https://www.wikipedia.org
Undercode AI: https://ai.undercodetesting.com