Skip to main content

What I learned from the Mastering LLMs course

· 3 min read

I recently signed up and completed the Mastering LLMs course on Maven and this post is to share what I learned from the course.

Mastering LLMs course certificate (My certificate can be viewed online at https://maven.com/certificate/BBBPHevx)

This course was initially focused on LLM fine-tuning, but as more instructors joined, it somehow evolved into an LLM conference. Being based in Singapore, I usually could only watch the recordings later as the live sessions were often at inconvenient hours. That said, the advantage of watching the recordings is I could speed up playback by 1.5 to 2 times (depending on the speakers), pause when needed, rewind, and rewatch parts for better understanding.

Do not finetune until you prove that Prompt Engineering and RAG don't work for you

A common approach I was already familiar with is to start with Prompt Engineering, followed by Retrieval Augmented Generation (RAG), and then only consider fine-tuning. The reason is obvious -- fine-tuning is more costly, more work and has a slower feedback loop than the other two approaches. I already knew about this prior to the course but what I learned from the course is the added benefit of using Prompt Engineering and RAG results as baselines for fine-tuning.

Finetuning may yield worse results

Beware, fine-tuning isn’t always a guaranteed improvement! For instance, during one of the sessions, an example was shared where a model was finetuned on Slack chat data actually performed worse because the model kept trying to emulate the messaging style of the Slack users rather than actually answer questions based on the past chat conversations. (I was trying to recall the specific example but I couldn't find it after the course).

Axolotl + Cloud infrasturcture make finetuning very approachable

The course participants received in total $3,500 worth of compute credits for various providers (JarvisLabs, Modal, OpenAI, HuggingFace, Weights and Biases, etc.)

Thanks to the compute credits, I had the chance to experiment with fine-tuning a model using Axolotl on JarvisLabs and Modal. Modal is great for running code remotely with minimal setup. However, for fine-tuning, I preferred JarvisLabs as it offered me the control I needed without unnecessary complexities. I really don't need the bells and whitles that Modal provides.

Evals

This is interesting as the instructors have different opinions.

One basic method involves writing pytest unit tests to assert expected behaviors. This is unconventional for a software engineer like myself because unit tests usually avoid external calls, but here they recommend running such tests in production to act as guardrails.

There were also insights on using LLMs for evaluations, showcasing different perspectives on best practices.

A great insight I learned from one of the instructors is always try to turn what you're trying to evaluate into a binary classification. This makes it a lot easier to implement, reason about and evaluate in isolation.

Just use APIs (or No GPU before PMF)

While it’s intellectually stimulating to learn about setting up infrastructure for training and inference, many instructors emphasized the practicality of using cloud providers like Replicate and OpenAI for customer projects. Using APIs is great for prototyping and trying things out. Setting up your own infrastructure makes sense only if you have strict data privacy and security requirements.

Conclusion

This course was incredibly insightful and practical, providing a broad range of perspectives and hands-on experiences in the world of LLMs. The private Discord channel is a goldmine. Though the course lacks more structured and guided homework assignments that we can use to practice and assess our learning.

Controlled Agentic Workflows Are All You Need

· 3 min read

Understanding AI Agents and Agentic Workflows

My introduction to the concept of "AI agents" began with Reinforcement Learning (RL), a field where agents learn by interacting with and observing their environment to maximize a reward function. Some of the most prominent examples of RL-based AI agents include AlphaGo and self-driving cars.

Despite the success of these applications, developing a reward function that effectively guides RL agents towards their objectives is a significant challenge. Recent advancements in Large Language Models (LLM) and Large Multi-modal Models (LMM) have shifted the focus towards LLM/LMM-powered agents. Lilian Weng from OpenAI provides an excellent overview of LLM-powered agents in her blog post. For further motivation, Andrej Karpathy explained why you should work on AI agents in his talk.

Andrew Ng offers a compelling analogy for understanding LLMs: generating the next token is like writing an essay in one pass, whereas AI agents iteratively refine the output in a loop. He elaborates on this in his talk. Ng is optimistic about AI agents, as highlighted in his recent LinkedIn post, where he notes a preference for discussing agentic workflows over AI agents due to the reduced likelihood of marketing jargon.

From Unbounded AI Agents to Bounded Agentic Workflows

The initial surge in popularity for AI agent frameworks such as AutoGPT and BabyAGI, driven by the rise of LLMs, was short-lived. These frameworks struggled with overly general and open-ended tasks, leading to a decline in interest.

In a latent.space podcast, Mike Conover of Brightwave articulated the limitations of unbounded agentic behaviors:

"I don't think that unbounded agentic behaviors are useful. Instead, a useful LLM system is more like a finite state machine where the behavior of the system occupies one of many different behavioral regimes, making decisions about which state to occupy next to achieve the goal."

This perspective underscores the non-deterministic nature of LLM systems, which contrasts with the predictability of traditional coded systems. In real-world applications, reliability is paramount, necessitating controllable AI agents.

The Spectrum of Autonomy

Langchain illustrates the varying levels of autonomy for AI agents and agentic workflows, emphasizing the importance of controllability and reliability in practical applications.

Langchain's Levels of Autonomy (source: https://blog.langchain.dev/what-is-an-agent/)

2024 is the Year of AI Agents as Finite State Machines and Workflows

As a professional AI engineer, I've been closely following various startups in the AI agent space. A common trend I've noticed is the modeling of AI agents as Finite State Machines or through more deterministic workflows. Here are a few notable examples:

Langchain is also heading in this direction, with the launch of LangGraph Cloud. They are actively educating the industry on creating reliable agents. Lance Martin's presentation on Building and Testing Reliable Agents provides excellent insights into these developments -- I wish this was available when I first started out.

TL;DW YouTube video summarizer with Python and Ollama

· 7 min read

Problem

I personally subscribe to many educational channels on YouTube and every day there are so many new videos to watch. While I appreciate the authors putting out long-form videos, it's really hard for me to watch them all and internalize the knowledge, even if I try to increase the playback speed and multitask. I just want to be able to extract and learn the gist of the YouTube videos without staring at the screen for hours.

Idea

I have been playing around with Large Language Models (LLMs) and thought I could build a small app to summarize YouTube videos. There are already some YouTube video summarizers out there but I want something that is completely free (in both speech and beer) and gives me better privacy. I call this project "TL;DW" (Too long; didn't watch).

Tech stack

Since this is just a side project, I would like to keep it simple and avoid any frameworks (nothing against LLM frameworks, though) using just Python and some libraries. I will use Ollama for running LLMs locally.

High level flow

Youtube summarizer flow

Here's the high-level flow of the YouTube summarizer. We first download a YouTube video with Pytube. Then we transcribe the downloaded video to text. Lastly, we summarize the text using an LLM model that runs locally with Ollama. After the initial setup, except for the YouTube video downloading part, everything else should run locally.

Pre-requisites

Install Ollama on your machine.

I use OpenAI's Whisper for transcribing videos. It requires the command line tool ffmpeg to be installed. I am using macOS Sonoma on an Apple Silicon (M1 Pro) chip so I installed it with my favorite package manager, brew.

brew install ffmpeg

If you don't use macOS, you can find the command for the different mainstream operating systems and package managers in the Github README.

Python packages

Here's my requirements.txt.

ollama
openai-whisper
pytube
rich

I have mentioned all these packages earlier except rich which adds beautiful output formatting and I like it better than Python's built-in Pretty Print.

Install the dependencies:

pip install -r requirements.txt

Import statements

As with any non-trivial Python code, we need to import some types that we will use later on:

from pytube import YouTube
from rich import print
import whisper

Download a YouTube video

Next we will download a YouTube video with Pytube. The code is pretty self-explanatory. By default, Pytube will set the download filename. We could extract the output filename but for simplicity's sake we will explicitly set the output filename so that we can easily load the filename into the transcriber later.

youtube_link = "https://www.youtube.com/watch?v=fqVLjtvWgq8" # Andrej Karpathy on Why you should work on AI agents
output_filename = "video.mp4" # you can name this anything you want

# Download a mp4 version of the video
YouTube(youtube_link).streams.filter(
progressive=True, file_extension="mp4"
).first().download(filename=output_filename)

Transcribe the downloaded video

Whisper really makes it very easy to transcribe a video. Here I'm using the base model. You can refer to https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages for the list of available models. Generally the bigger the model the faster the transcription but you will need more VRAM.

model = whisper.load_model("base")
result = model.transcribe(output_filename)
print(result["text"])

Depending on the video length, this may take a while to run. The YouTube video above is 6 minutes and 30 seconds long and the transcription took about 40 seconds on my local machine.

Summarize the transcript

Ollama is not an LLM per se. It is a piece of software that lets you run LLMs locally. Therefore you will need to pull a model before you can start using it for generation.

You can find all the models supported by Ollama at https://ollama.com/library. I generally use 7B or 8B models (ie. models with 7 or 8 billion parameters) on my local machine. You can experiment with the bigger models if you have better GPUs with more VRAMs. At the time of writing (June 2024), here are some of the most popular open 7B/8B LLMs (in no particular order):-

  1. Meta's Llama 3 8B
  2. Mistral AI's Mistral 7B
  3. Alibaba's Qwen 2 7B
  4. Google's Gemma 7B

Let's use Meta's Llama 3 8B (you can easily substitute it with your favorite LLM). You can pull the model either using the ollama CLI:

ollama pull llama3

Meta's Llama 3 comes with 8B and 70B models and unfortunately my MacBook Pro is not powerful enough to run a 70B model. Note that we don't need to specify 8b in the command because Ollama by default installs the 8b version for llama3.

However, I generally prefer to specify the model version to make it explicit and avoid confusion.

ollama run llama3:8b

Alternatively, you can use the Ollama's Python API to pull the model:

import ollama

model_name = 'llama3:8b'
ollama.pull(model_name)

My recommendation is to use Ollama CLI if you're just getting started. Once you need to pull models dynamically, you can then try to use the Ollama Python API for that purpose.

Now that we have the llama3 model on our local machine, we can start using it to generate a summary of the video transcription.

chat = ollama.chat(
model=model_name,
messages=[
{
'role': 'user',
'content': 'Summarize the following text in bullet points.\n\n' + result['text']
}
]
)

So this is basically asking the llama3 model to summarize the transcription in bullet points. As you can tell this is just a very simple prompt for demonstration purposes but the point is now you have the freedom to craft your own prompt.

On my machine it took approximately 25 seconds.

And this is what I got:

print(chat['message']['content'])
* The speaker was recruited to talk about AI agents, which they have a personal connection to due to an early
project at OpenAI.
* In 2016, the speaker worked on a project called World of Bits, which aimed to create AI agents that could perform
tasks using computers and keyboards. However, the technology wasn't mature enough, and the project didn't yield the
desired results.
* The speaker suggests that the right approach at the time would have been to focus on building language models
instead.
* Today, AI agents are "cool again," but with a different team and approach.
* The speaker thinks that AGI will involve many AI agents working together, rather than just one.
* They also caution that some demos may be easy to build, but turning them into products can take years (e.g.,
self-driving cars, PR).
* The speaker draws inspiration from neuroscience, suggesting that AI agents could benefit from understanding how
the brain works. For example, they propose that AI agents might need a "hippocampus" equivalent to record and
retrieve memories.
* The speaker believes that building AI agents is at the forefront of capability today, with many large labs like
OpenAI focused on other areas (e.g., training methods).
* They conclude by expressing excitement about what people are building and encouraging them to continue pushing
the boundaries.

Even with a very simple prompt, I was already able to generate the summary above and I didn't have to watch the video at all.