Teaching Open AI Agents to Write CUDA Kernels: How Claude Upskilled Smaller Models

In the rapidly evolving world of AI, one of the most exciting frontiers is teaching smaller, accessible models to tackle complex technical tasks. Writing optimized CUDA kernels—a highly specialized coding skill for GPU programming—is notoriously challenging, even for advanced models. A new approach using the tool Upskill demonstrates that large, sophisticated models like Claude Opus 4.5 can generate reusable “agent skills” that empower smaller or cheaper models to handle such tasks effectively. This methodology not only reduces computational costs but also accelerates the adoption of domain-specific knowledge across AI systems.

Understanding Agent Skills

Agent skills have emerged as a transformative concept in AI development. At their core, skills are structured files that encapsulate instructions, documentation, and code, allowing AI models to execute complex tasks more efficiently. By storing context as markdown and scripts, agent skills become portable, shareable, and reviewable across different models. The real power lies in using them for tasks that models struggle with naturally, such as CUDA kernel development, rather than routine or simple coding operations.

Claude as the Teacher Model

The Upskill workflow begins by leveraging a large model—Claude Opus 4.5—as a teacher. Claude is instructed to build a CUDA kernel interactively, documenting each step of the process and generating a trace. This trace reveals the model’s decision-making, errors, and optimizations, which becomes the foundation for creating a skill. Iterative testing with draft skills allows both performance improvements and evaluation of smaller models that will eventually use the skill.

Creating the Skill File

Once the teacher completes the task, the next step is skill generation. Several methods exist: creating a skill directly within the session, using Anthropic’s “skill creator,” or employing Upskill to produce a skill from the agent trace. Upskill further adds value by generating test cases to validate the skill. When applied back to the teacher model, performance parity indicates that the skill accurately captures the task knowledge.

Transferring Skills to Smaller Models

With the skill validated, it can now be deployed to smaller, local, or cost-efficient models. Skills adhere to a standardized directory structure, making integration straightforward. Upskill evaluates the skill on these models by running test cases both with and without the skill. Results often show dramatic improvements in task performance or efficiency, though some models may require further iterations. This teacher-to-student approach ensures that domain expertise can be leveraged widely without needing high-cost computation at every step.

Optimizing Performance Beyond Accuracy

Performance evaluation is not limited to accuracy. Some models may achieve the same result with fewer tokens when a skill is applied, reducing cost and latency. For example, the MoonshotAI Kimi-K2-Thinking model saw improvements in both accuracy and token efficiency, whereas Claude Opus 4.5 experienced increased token usage without a performance gain. This highlights the importance of iterative evaluation and fine-tuning to maximize the value of skills across diverse models.

Real-World Application: CUDA Kernel Development

Upskill was tested on the challenging task of building CUDA kernels with HuggingFace’s kernel-builder library. The skill created for Claude captured detailed domain knowledge: GPU architecture targeting, project structure, memory optimizations, and PyTorch bindings. For instance, the skill encodes specific knowledge like H100 GPU compute capabilities (9.0), memory alignment rules, and asynchronous memory copy requirements. By compressing hours of research into a few hundred tokens, the skill allows smaller models to build production-ready kernels with minimal guidance.

Step-by-Step: Using Upskill

Install Upskill:

bash

Copy code

pip install upskill

uvx upskill –help

Generate a Skill from a Trace:

bash

Copy code

upskill generate write NVIDIA kernels –from ./trace.md

Evaluate Models:

bash

Copy code

upskill eval ./skills/my-skill/ –model haiku –model sonnet

Iterate and Refine: Skills can be regenerated and tested to improve both accuracy and token efficiency.

Performance Results

Evaluation shows impressive results when transferring skills from Claude to smaller models:

Sonnet: 60% → 95% (+35%)

Local GLM-4.7: 40% → 85% (+45%)

These improvements highlight that high-performing teacher models can unlock substantial gains for less capable or more affordable student models, enabling cost-effective scaling of complex AI workflows.

What Undercode Says:

Democratizing High-End AI Expertise

Using Upskill, even models running on a laptop can now handle tasks previously limited to high-end AI. The process effectively democratizes technical expertise like CUDA programming, previously the domain of highly specialized engineers.

Cost-Effective Scaling of Domain Knowledge

By separating skill generation from execution, expensive models only need to be used for teaching. Cheaper or local models then inherit these skills, significantly reducing cloud compute costs while maintaining performance.

Iterative Refinement as a Key Strategy

Skill creation is inherently iterative. Initial performance may vary across models, requiring repeated evaluation and improvement. Upskill’s automated test case generation is crucial in streamlining this cycle.

Token Efficiency Matters

Beyond accuracy, token usage impacts operational cost. Some skills can dramatically reduce token consumption for recurring tasks, making them more practical for production systems.

Cross-Model Portability

Skills are standardized and portable across tools like Claude Code, Codex, Cursor, and more. This interoperability ensures that once a skill is created, it can be broadly deployed without rewriting the core logic.

Building a Knowledge Library

Upskill encourages the creation of a reusable skill library. Teams can capture tribal knowledge, internal processes, and specialized expertise, turning AI models into more reliable, consistent contributors to technical workflows.

Enhanced Productivity for Developers

Automating repetitive, specialized tasks like kernel development frees human developers to focus on higher-level problems, accelerating project timelines and innovation.

Practical Implications for AI Teams

Small companies or teams without access to top-tier models can still leverage advanced capabilities. By using Upskill, they gain a competitive edge without prohibitive costs.

🔍 Fact Checker Results

✅ Upskill genuinely allows skill transfer from large to smaller models.
✅ CUDA kernel-building knowledge, including H100 optimizations, is accurately encoded.
❌ Skill effectiveness varies across models; some may show token inefficiencies.

📊 Prediction

The Upskill methodology is likely to become a standard approach for specialized AI tasks. Within 12 months, we can expect broader adoption in industries requiring domain-specific AI expertise, such as scientific computing, autonomous systems, and large-scale ML deployments. By combining high-cost teacher models with low-cost students, AI teams will drastically reduce operational expenses while improving technical accuracy. Agent skill libraries may evolve into core organizational knowledge assets, transforming AI from a tool into a reliable collaborator.

This approach marks a turning point: AI isn’t just executing tasks—it’s teaching itself and its smaller counterparts, enabling faster, smarter, and more cost-effective solutions for highly specialized challenges.

🕵️‍📝✔️Let’s dive deep and fact‑check.

References:

Reported By: huggingface.co
Extra Source Hub (Possible Sources for article):
https://www.reddit.com
Wikipedia
OpenAi & Undercode AI

Image Source:

Unsplash
Undercode AI DI v2
Bing

🔐JOIN OUR CYBER WORLD [ CVE News • HackMonitor • UndercodeNews ]

💬 Whatsapp | 💬 Telegram

📢 Follow UndercodeNews & Stay Tuned:

𝕏 formerly Twitter 🐦 | @ Threads | 🔗 Linkedin | 🦋BlueSky | 🐘Mastodon

Listen to this Post

Understanding Agent Skills

Claude as the Teacher Model

Creating the Skill File

Transferring Skills to Smaller Models

Optimizing Performance Beyond Accuracy

Real-World Application: CUDA Kernel Development

Step-by-Step: Using Upskill

Install Upskill:

bash

Copy code

pip install upskill

uvx upskill –help

Generate a Skill from a Trace:

bash

Copy code

upskill generate write NVIDIA kernels –from ./trace.md

Evaluate Models:

bash

Copy code

upskill eval ./skills/my-skill/ –model haiku –model sonnet

Performance Results

Sonnet: 60% → 95% (+35%)

Local GLM-4.7: 40% → 85% (+45%)

What Undercode Says:

Democratizing High-End AI Expertise

Cost-Effective Scaling of Domain Knowledge

Iterative Refinement as a Key Strategy

Token Efficiency Matters

Cross-Model Portability

Building a Knowledge Library

Enhanced Productivity for Developers

Practical Implications for AI Teams

🔍 Fact Checker Results

📊 Prediction

🕵️‍📝✔️Let’s dive deep and fact‑check.

References:

Image Source:

🔐JOIN OUR CYBER WORLD [ CVE News • HackMonitor • UndercodeNews ]

📢 Follow UndercodeNews & Stay Tuned:

Share this:

Explore More: