Xet Revolutionizes Storage on Hugging Face Hub: A Major Step Forward in AI Collaboration

The Hugging Face Hub, a platform that enables AI developers to access and share models and datasets, has undergone a significant transformation. In a groundbreaking move, the Hugging Face team introduced Xet, a new storage solution that promises to enhance the speed and efficiency of working with massive models and datasets. This shift marks a major milestone in the team’s vision to improve collaboration and optimize performance for AI builders. Let’s dive into the details of how Xet is making a difference and what challenges the team faced during this migration.

Xet Storage Migration: A New Era for Hugging Face Hub

Over recent weeks, Hugging

In this article, we’ll explore the journey of getting Xet on the Hub, including the challenges the team encountered, the innovations behind the Xet system, and the real-world results that were achieved.

The Xet Difference: Redefining Large File Storage

Traditionally, Hugging Face repositories used LFS for storage, which works by storing large files in separate object storage. However, LFS had limitations, particularly when it came to deduplication and handling large, multi-gigabyte files. With LFS, even small changes to a file resulted in the need to upload the entire file again, which was inefficient and time-consuming.

Xet, on the other hand, utilizes content-defined chunking (CDC) for deduplication at the byte level, specifically by breaking down files into ~64KB chunks. This means that when a small change is made to a file, only the altered chunks are uploaded, saving bandwidth and significantly improving upload speeds.

For instance, if a 5GB SQLite database was updated by just 1MB, LFS would have required re-uploading the entire 5GB file, whereas Xet only uploads the new 1MB chunk. This makes large file uploads up to 130 times faster compared to the previous LFS system.

The Migration Process: A Complex But Successful Transition

On February 20th, the Hugging Face team embarked on a large-scale migration to move repositories from LFS to Xet. The migration was meticulously planned, with internal tooling built to facilitate the process and the ability to roll back to LFS if needed. The team successfully migrated 4.5 TB of data by the end of the day without major disruptions.

The first step was validating the system with real-world usage, ensuring that repositories could be accessed by a variety of platforms, libraries, and development environments. This migration shifted about 6% of the Hub’s download traffic to Xet infrastructure, providing invaluable insight into how the new system performed under real user conditions.

However, as with any major migration, there were challenges. These included issues such as download overhead from the new block format and unexpected load imbalances in the CAS cluster. Despite these hurdles, the Xet team continued to iterate on the system, making architectural improvements to optimize performance.

Overcoming Post-Migration Hurdles: Lessons Learned

In the aftermath of the migration, the team identified several performance bottlenecks. One significant issue was related to the download overhead from the new block format. Requests for partial data from a block resulted in inefficient data transfer, as the system would read entire blocks rather than just the requested range. To resolve this, the team updated the block format to store chunk-length metadata, which reduced the download latency by 35%.

Another challenge came from pod load imbalances in the CAS cluster, where some pods experienced spikes in active uploads while others were underutilized. The root cause was traced to unflushed page cache writes, which caused memory pressure and throughput throttling. To mitigate this, the team implemented limits on concurrent uploads per pod and adjusted the load balancing algorithm to better distribute the traffic.

The Takeaways: Scaling for Success

The migration to Xet was not without its bumps, but it led to significant improvements in the system’s design and performance. The lessons learned during this process will continue to benefit the Hugging Face community by ensuring that future byte transfers on the Hub are faster and more reliable. Real-world load testing proved invaluable in exposing edge cases and guiding system improvements, ensuring that Xet storage can handle the demands of the entire Hugging Face ecosystem.

With the initial migration now complete, the Xet storage system is fully integrated into the Hugging Face Hub. Users can now expect faster upload and download speeds, smoother collaboration on large models and datasets, and a more efficient way to manage AI workflows.

What Undercode Says: A Deeper Analysis

The of Xet on the Hugging Face Hub signifies a major shift in how large-scale AI models and datasets are handled. The need for efficient file storage and transfer has been a critical pain point for AI developers working with enormous datasets. Xet’s use of content-defined chunking is a smart solution to the problem of inefficiency in data uploads and downloads. This approach not only speeds up the process but also significantly reduces the amount of data being transferred, which is a major win for developers working with limited bandwidth or large-scale models.

What stands out is the careful, iterative approach the Hugging Face team took to test and refine the system. Migrating 4.5 TB of data in a single day without major disruptions is a testament to the team’s preparation and expertise. The post-migration adjustments, particularly to the block format and pod load balancing, demonstrate a strong commitment to optimizing the system’s performance in real-world conditions.

However, the article also highlights an important truth: even with rigorous planning, the challenges of scaling such a system can only truly be understood once real users begin interacting with it. Hugging Face’s approach to incremental migration allowed them to address issues as they arose without causing significant downtime for users.

Additionally, the improvements to Xet will likely have long-term benefits for AI development. As more users adopt the platform, the efficiency of the system will only increase, reducing bottlenecks in the collaborative process. For AI researchers and developers, this means faster iterations, fewer frustrations with slow uploads or downloads, and more time spent on building and testing their models.

Overall, Xet represents a step toward more scalable, reliable, and efficient AI infrastructure, and its integration into the Hugging Face Hub will undoubtedly be a game-changer for the AI community.

Fact Checker Results

The migration of 4.5 TB of data was completed without any major disruptions, demonstrating the effectiveness of Xet storage for real-world usage.
Post-migration adjustments led to a 35% reduction in download latency by optimizing the block format and streamlining data retrieval.
The Xet system’s architecture has been refined to address challenges like load imbalances and inefficient data transfer, ensuring better scalability and performance moving forward.