Learning to Build AI Agents with HuggingFace
Over the past few weeks, I took the HuggingFace AI Agents course, and completed its final assignment of building an agent that passes a GAIA benchmark test. In this post I’ll document my learnings from working on this project.
Acknowledgement
First of all, a huge thank you to Joffrey Thomas, Ben Burtenshaw, Thomas Simonini, Sergio Paniego, and the rest of the HuggingFace team that authored this free course. For anyone looking to get into building AI agents, I would highly recommend checking it out. It starts off with the foundamental concepts of AI agents, then gets hands-on with three different frameworks (smolagents, LlamaIndex, LangGraph), and it goes on to more advanced topics like retrieval-augmented generation (RAG), multi-agent architecture, fine-tuning, and observability and evaluation of agents.
About the Final Assignment
The final assignment asks us to build an AI agent that can correctly answer at least 30% of a subset of 20 questions out of the level 1 question set from GAIA benchmark. The agent’s answer is only correct when it matches the ground truth exactly.
About GAIA benchmark, to quote directly from the course page:
GAIA is a benchmark designed to evaluate AI assistants on real-world tasks that require a combination of core capabilities—such as reasoning, multimodal understanding, web browsing, and proficient tool use.
The benchmark features 466 carefully curated questions that are conceptually simple for humans, yet remarkably challenging for current AI systems.
To illustrate the gap:
Humans: ~92% success rate GPT-4 with plugins: ~15% Deep Research (OpenAI): 67.36% on the validation set
GAIA highlights the current limitations of AI models and provides a rigorous benchmark to evaluate progress toward truly general-purpose AI assistants.>
GAIA is carefully designed around the following pillars:
Real-world difficulty: Tasks require multi-step reasoning, multimodal understanding, and tool interaction.
Human interpretability: Despite their difficulty for AI, tasks remain conceptually simple and easy to follow for humans.
Non-gameability: Correct answers demand full task execution, making brute-forcing ineffective.
Simplicity of evaluation: Answers are concise, factual, and unambiguous—ideal for benchmarking.
The project is set up with Gradio and runs on HuggingFace Space. The questions are retrieved from an API, the agent processes them, then the answers are sent back to the API for submission and grading.
The 30% passing score was actually more challenging than I initially anticipated, but it is achievable without relying on the latest and most powerful models.
Here’s the link to my agent: HuggingFace Space
My Approach & Observations
From the get-go, my goal was to pass this assignment without spending money on Inference API usage, that meant trying out localhosting with Ollama, and seeking out and taking advantage of free inference APIs as much as possible. Of course, the tradeoff here is the performance of the models available.
Another thing I found out when checking the assignment leaderboard was that a lot of submissions with perfect scores were just using a RAG over a vector database containing the entire set of 20 questions and their ground truths from GAIA benchmark, which is in my opinion pointless. So I decided to steer clear from RAG and fine-tuning for this assignment.
Setting Up Inference API
My first attempt was localhosting with Ollama, but I quickly realized my GPU’s limited VRAM won’t allow me to use any models beyond 3B parameters. I then tried to use HuggingFace’s InferenceClient, but I quickly hit the monthly credit limit.
After some further searching, I switched to TogetherAI. They offered $1.00 free credit upon sign-up and free access to several models, including the Llama-3.3-70B-Instruct-Turbo-Free
model which my agent mostly relied on. It does have limited context length and rate limit, but it sufficed for this assignment.
Agent Tools
Once the main inference API was settled, I started building the tools for my agent based on the types of tasks contained in the question set, and these were the ones I came up with:
web_search
: searches the web with Brave Searchwiki_search
: searches Wikipediapython_repl
: python code interpreterget_youtube_transcript
: retrieves the transcript of a YouTube videospeech_recognition
: transcribes an audio file usingwhisper-tiny
reverse_string
: reverses the character order of the input stringquery_image
: analyzes a given image usingQwen2.5-VL-3B-Instruct
query_video
: analyzes a given video usingQwen2.5-VL-3B-Instruct
Out of all the tools here, the reverse_string
tool is the most specific, in fact it was added just because one question had its characters in reverse. The LLM by itself was smart enough to realize the sentence was reversed, but struggled to reverse it back correctly without this tool.
Choice of Models and Agent Architecture
For the agent itself, my experiment started with a single ReAct agent built with LangGraph, running Llama-3.3-70B-Instruct-Turbo-Free
, with access to the above tools. I found out that while it was able to correctly make tool calls over half of the time and solve tasks like transcribing audio, retrieving video transcript, and reversing string, it struggled with web and Wikipedia research tasks that require longer context length and extended reasoning capability.
To tackle the research tasks, I chose Qwen3-235B-A22B-fp8-tput
from TogetherAI, which is priced at $0.20 per 1M input tokens and $0.60 per 1M output tokens. At this point, I decided to switch to a multi-agent setup, and restrict the usage of the Qwen3 model to just the research agent to minimize the credit usage.
With the multi-agent setup, I initially created several ReAct agents that each specialize in research, coding, vision, and video transcription, and a supervisor agent that takes in the queries and delegate the tasks to the appropriate specialized agents, then processes their responses to form the final answers.
When I used Llama-3.3-70B-Instruct-Turbo-Free
for all the agents other than research agent, I discovered that at the specialized agent level these agents somehow started failing tool calls more often. This problem went away when I switched the model for all agents to Qwen3-235B-A22B-fp8-tput
, but I still wanted to minimize credit usage.
To fix this problem, I scrapped all the specialized agents and kept only the supervisor and the research agent, and gave the supervisor all the tools except web_search
and wiki_search
. It turns out this simpler solution gave the best overall function call performance without adding to credit cost.
When it comes to vision tasks involving images and videos, because Qwen2.5-VL-3B-Instruct
was run locally instead of through an API, I couldn’t run the larger models with my hardware. It was able to generate descriptions of the visual content, even recognizing some bird species, but struggled to correctly answer the questions. I think Google Gemini models might be worth a try here, but I haven’t gotten around to it.
Other Things to Note
A tricky part of this assignment is that the answers need to exactly match the ground truths. For example, when the ground truth of a question has its first letter in uppercase, if the agent answered in lowercase it wouldn’t get the score. This also varies from question to question, so a hardcoded solution wouldn’t work. Therefore details like this need to be addressed in the system prompt to the agent, and in this regard Llama-3.3-70B-Instruct-Turbo-Free
seemed to follow the system prompt better than Qwen3-235B-A22B-fp8-tput
did.
Tool names and tool parameter names are important, in general they should be simple but precise and logical. I noticed sometimes the agent made mistakes with the parameter name in a function call and caused it to fail, despite having the tools listed in the system prompt. If I change the parameter name to the one it was looking for, the problem could get resolved. For example, changing the input parameter name of query_video
from url
to video_url
.
This brings me on to tracing tool for tracking agent execution, I highly recommend incorporating it early on into agent development. I used LangSmith, it’s easy to set up and shows each step the agent took, the JSON of function call, the time taken, exceptions and errors, and token usage for each task execution.
Lastly, the course has a Discord server, it’s where fellow learners help each other, and the first place to look when running into technical problems with the course material.
Next Steps
For me, this assignment has been a great way to practice building competent AI agents. Out of curiosity, I also added a LlamaIndex agent in the same HuggingFace Space, just to explore a different framework. I’ll be continuing on my learning journey, and I’m excited about the idea of building agents for more practical real-life automation applications.
If you read up to this point, thank you, I hope this is helpful in some way :)