GPIO stands for General-Purpose Input/Output. Itβs a set of programmable pins that you can use to connect and control various electronic components.
You can set each pin to either read signals (input) from things like buttons and sensors or send signals (output) to things like LEDs and motors. This lets you interact with and control the physical world using code!
A great resource for understanding pin numbering can be found at pinout.beagley.ai
This ongoing Docker Labs GenAI series explores the exciting space of AI developer tools. At Docker, we believe there is a vast scope to explore, openly and without the hype. We will share our explorations and collaborate with the developer community in real-time. Although developers have adopted autocomplete tooling like GitHub Copilot and use chat, there is significant potential for AI tools to assist with more specific tasks and interfaces throughout the entire software lifecycle. Therefore, our exploration will be broad. We will be releasing software as open source so you can play, explore, and hack with us, too.
Using new tools on the command line can be frustrating. Even if we are confident that weβve found the right tool, we might not know how to use it.
Telling an agent to RT(F)M
A typical workflow might look something like the following.
Install tool.
Read the documentation.
Run the command.
Repeat.
Can we improve this flow using LLMs?
Install tool
Docker provides us with isolated environments to run tools. Instead of requiring that commands be installed, we have created minimal Docker images for each tool so that using the tool does not impact the host system. Leave no trace, so to speak.
Read the documentation
Man pages are one of the ways that authors of tools ship content about how to use that tool. This content also comes with standard retrieval mechanisms (the man tool). A tool might also support a command-line option like --help. Letβs start with the idealistic notion that we should be able to retrieve usage information from the tool itself.
In this experiment, weβve created two entry points for each tool. The first entry point is the obvious one. It is a set of arguments passed directly to a command-line program. The OpenAI-compatible description that we generate for this entry point is shown below. We are using the same interface for every tool.
{"name": "run_my_tool",
"description": "Run the my_tool command.",
"parameters":
{"type": "object",
"properties":
{"args":
{"type": "string",
"description": "The arguments to pass to my_tool"}}},
"container": {"image": "namespace/my_tool:latest"}}
The second entrypoint gives the agent the ability to read the man page and, hopefully, improve its ability to run the first entrypoint. The second entrypoint is simpler, because it only does one thing (asks a tool how to use it).
{"name": "my_tool_manual",
"description": "Read the man page for my_tool",
"container": {"image": "namespace/my_tool:latest", "command": ["man"]}}
Run the command
Letβs start with a simple example. We want to use a tool called qrencode to generate a QR code for a link. We have used our image generation pipeline to package this tool into a minimal image for qrencode. We will now pass this prompt to a few different LLMs; we are using LLMs that have been trained for tool calling (e.g., GPT 4, Llama 3.1, and Mistral). Hereβs the prompt that we are testing:
Generate a QR code for the content https://github.com/docker/labs-ai-tools-for-devs/blob/main/prompts/qrencode/README.md. Save the generated image to qrcode.png.
If the command fails, read the man page and try again.
Note the optimism in this prompt. Because itβs hard to predict what different LLMs have already seen in their training sets, and many command-line tools use common names for arguments, itβs interesting to see what LLM will infer before adding the man page to the context.
The output of the prompt is shown below. Grab your phone and check it out.
Repeat
When an LLM generates a description of how to run something, it will usually format that output in such a way that it will be easy for a user to cut and paste the response into a terminal:
qrencode -o qrcode.png 'my content'
However, if the LLM is generating tool calls, weβll see output that is instead formatted to be easier to run:
We respond to this by spinning up a Docker container.
Running the tool as part of the conversation loop is useful even when the command fails. In Unix, there are standard ways to communicate failures. For example, we have exit codes, and stderr streams. This is how tools create feedback loops and correct our behavior while weβre iterating at the terminal. This same mechanism can be used in a conversation loop involving an LLM.
To illustrate, here is another simple example. Weβll try running a tool that writes ASCII art with the following prompt.
Write ascii art with the message "Docker" using the font "helvetica". Read the man page for figlet if you detect any errors and then try again.
In our test, this did fail. However, it also described the apparent issue on the stderr output stream.
Error: error: could not load font helvetica
By including this message in the conversation loop, the assistant can suggest different courses of action. Different LLMs produced different results here. For example, Llama 3.1 gives instructions for how to install the missing font. On the other hand, GPT 4 re-ran the command, but only after having made the βexecutiveβ decision to try a different font.
I'm sorry, it seems the font Helvetica is not available. Let's try it with the default font.
Here is the output of the figlet command with the default font (term):
____ _
| \ _ | | _____ _
| | | |/ \ / _| |/ / \ '_|
| |_| | (_) | (__| < __/ |
|____/ \___/ \___|_|\_\___|_|
We are very early in understanding how to take advantage of this apparent capacity to try different approaches. But this is another reason why quarantining these tools in Docker containers is useful. It limits their blast radius while we encourage experimentation.
Results
We started by creating a pipeline to produce minimal Docker images for each tool. The set of tools was selected based on whether they have outputs useful for developer-facing workflows. We continue to add new tools as we think of new use cases. The initial set is listed below.
There was a set of initial problems with context extraction.
Missing manual pages
Only about 60% of the tools we selected have man pages. However, even in those cases, there are usually other ways to get help content. The following steps show the final procedure we used:
Try to run the man page.
Try to run the tool with the argument --help.
Try to run the tool with the argument -h.
Try to run the tool with --broken args and then read stderr.
Using this procedure, every tool in the list above eventually produced documentation.
Long manual pages
Limited context lengths impacted some of the longer manual pages, so it was still necessary to employ standard RAG techniques to summarize verbose man pages. Our tactic was to focus on descriptions of command-line arguments and sections that had sample usage. These had the largest impact on the quality of the agentβs output. The structure of Unix man pages helped with the chunking, because we were able to rely on standard sections to chunk the content.
Subcommands
For a small set of tools, it was necessary to traverse a tree of help menus. However, these were all relatively popular tools, and the LLMs we deployed already knew about this command structure. Itβs easy to check this out for yourself. Ask an LLM, for example: βWhat are the subcommands of Git?β or βWhat are the subcommands of Docker?β Maybe only popular tools get big enough that they start to be broken up into subcommands.
Summary
We should consider the active role that agents can play when determining how to use a tool. The Unix model has given us standards such as man pages, stderr streams, and exit codes, and we can take advantage of these conventions when asking an assistant to learn a tool. Beyond distribution, Docker also provides us with process isolation, which is useful when creating environments for safe exploration.
Whether or not an AI can successfully generate tool calls may also become a metric for whether or not a tool has been well documented.