This ongoing Docker Labs GenAI series explores the exciting space of AI developer tools. At Docker, we believe there is a vast scope to explore, openly and without the hype. We will share our explorations and collaborate with the developer community in real time. Although developers have adopted autocomplete tooling like GitHub Copilot and use chat, there is significant potential for AI tools to assist with more specific tasks and interfaces throughout the entire software lifecycle. Therefore, our exploration will be broad. We will be releasing software as open source so you can play, explore, and hack with us, too.
How do you turn a PDF into a set of images? A modern suggestion could be to get Claude or GPT to write you a CLI command for something like ImageMagick. Let’s follow that train of thought.
If we enter a prompt like “How do I turn a PDF into a bunch of images using ImageMagick,” we will indeed get something that looks like it might be a solution.
Imagemagick includes a command-line tool called convert which you can use to convert PDF files into images. Here's how you can do it:
1. Open your terminal
2. Change directory to where your PDF is located:
```
cd /path/to/your/pdf
```
3. Use the following command to convert the PDF into images (in this case JPEG):
```
convert -density 300 -quality 100 your_file.pdf output_name.jpg
```
The agent will then explain how convert works and finish with a suggestion to “please make sure that you have installed ImageMagick before running these commands.”
Instead of worrying about whether ImageMagick is installed, however, let’s use Docker as our tool provider and prompt it to go ahead and run the command. If you’re following along in this series, you’ll have seen that we are using Markdown files to mix together tools and prompts. Here’s the first prompt we tried:
---
tools:
- name: imagemagick
---
# prompt user
Use Imagemagick to convert the family.pdf file into a bunch of jpg images.
After executing this prompt, the LLM generated a tool call, which we executed in the Docker runtime, and it successfully converted family.pdf into nine .jpg files (my family.pdf file had nine pages).
We have given enough context to the LLM that it is able to plan a call to this ImageMagick binary. And, because this tool is available on Docker Hub, we don’t have to “make sure that ImageMagick is installed.” This would be the equivalent command if you were to use docker run directly:
# family.pdf must be located in your $PWD
docker run --rm -v $PWD:/project --workdir /project vonwig/imageMagick:latest convert -density 300 -quality 300 family.pdf family.jpg
The tool ecosystem
How did this work? The process relied on two things:
Tool distribution and discovery (pulling tools into Docker Hub for distribution to our Docker Desktop runtime).
Automatic generation of Agent Tool interfaces.
When we first started this project, we expected that we’d begin with a small set of tools because the interface for each tool would take time to design. We thought we were going to need to bootstrap an ecosystem of tools that had been prepared to be used in these agent workflows.
However, we learned that we can use a much more generic approach. Most tools already come with documentation, such as command-line help, examples, and man pages. Instead of treating each tool as something special, we are using an architecture where an agent responds to failures by reading documentation and trying again (Figure 2).
We see a process of experimenting with tools that is not unlike what we, as developers, do on the command line. Try a command line, read a doc, adjust the command line, and try again.
The value of this kind of looping has changed our expectations. Step one is simply pulling the tool into Docker Hub and seeing whether the agent can use it with nothing more than its out-of-the-box documentation. We are also pulling open source software (OSS) tools directly from nixpkgs, which gives us access to tens of thousands of different tools to experiment with.
Docker keeps our runtimes isolated from the host operating system, while the nixpkgs ecosystem and maintainers provide a rich source of OSS tools.
As expected, packaging agents still run into issues that force us to re-plan how tools are packaged. For example, the prompt we showed above might have generated the correct tool call on the first try, but the ImageMagick container failed on the first run with this terrible-looking error message:
function call failed call exited with non-zero code (1): Error: sh: 1: gs: not found
Fortunately, feeding that error back into the LLM resulted in the suggestion that convert needs another tool, called Ghostscript, to run successfully. Our agent was not able to fix this automatically today. However, we adjusted the image build slightly and now the “latest” version of the vonwig/imagemagick:latest no longer has this issue. This is an example of something we only need to learn once.
The LLM figured out convert on its own. But its agency came from the addition of a tool.
This ongoing Docker Labs GenAI series explores the exciting space of AI developer tools. At Docker, we believe there is a vast scope to explore, openly and without the hype. We will share our explorations and collaborate with the developer community in real-time. Although developers have adopted autocomplete tooling like GitHub Copilot and use chat, there is significant potential for AI tools to assist with more specific tasks and interfaces throughout the entire software lifecycle. Therefore, our exploration will be broad. We will be releasing software as open source so you can play, explore, and hack with us, too.
At Docker Labs, we’ve been exploring how LLMs can connect different parts of the developer workflow, bridging gaps between tools and processes. A key insight is that LLMs excel at fixing code issues when they have the right context. To provide this context, we’ve developed a process that maps out the codebase using linting violations and the structure of top-level code blocks.
By combining these elements, we teach the LLM to construct a comprehensive view of the code, enabling it to fix issues more effectively. By leveraging containerization, integrating these tools becomes much simpler.
Previously, my linting process felt a bit disjointed. I’d introduce an error, run Pylint, and receive a message that was sometimes cryptic, forcing me to consult Pylint’s manual to understand the issue. When OpenAI released ChatGPT, the process improved slightly. I could run Pylint, and if I didn’t grasp an error message, I’d copy the code and the violation into GPT to get a better explanation. Sometimes, I’d ask it to fix the code and then manually paste the solution back into my editor.
However, this approach still required several manual steps: copying code, switching between applications, and integrating fixes. How might we improve this process?
Docker’s AI Tools for Devs prompt runner is an architecture that allows us to integrate tools like Pylint directly into the LLM’s workflow through containerization. By containerizing Pylint and creating prompts that the LLM can use to interact with it, we’ve developed a system where the LLM can access the necessary tools and context to help fix code issues effectively.
Understanding the cognitive architecture
For the LLM to assist effectively, it needs a structured way of accessing and processing information. In our setup, the LLM uses the Docker prompt runner to interact with containerized tools and the codebase. The project context is extracted using tools such as Pylint and Tree-sitter that run against the project. This context is then stored and managed, allowing the LLM to access it when needed.
By having access to the codebase, linting tools, and the context of previous prompts, the LLM can understand where problems are, what they are, and have the right code fragments to fix them. This setup replaces the manual process of finding issues and feeding them to the LLM with something automatic and more engaging.
Streamlining the workflow
Now, within my workflow, I can ask the assistant about code quality and violations directly. The assistant, powered by an LLM, has immediate access to a containerized Pylint tool and a database of my code through the Docker prompt runner. This integration allows the LLM to use tools to assist me directly during development, making the programming experience more efficient.
This approach helps us rethink how we interact with our tools. By enabling a conversational interface with tools that map code to issues, we’re exploring possibilities for a more intuitive development experience. Instead of manually finding problems and feeding them to an AI, we can convert our relationship with tools themselves to be conversational partners that can automatically detect issues, understand the context, and provide solutions.
Walking through the prompts
Our project is structured around a series of prompts that guide the LLM through the tasks it needs to perform. These prompts are stored in a Git repository and can be versioned, tracked, and shared. They form the backbone of the project, allowing the LLM to interact with tools and the codebase effectively. We automate this entire process using Docker and a series of prompts stored in a Git repository. Each prompt corresponds to a specific task in the workflow, and Docker containers ensure a consistent environment for running tools and scripts.
Workflow steps
An immediate and existential challenge we encountered was that this class of problem has a lot of opportunities to overwhelm the context of the LLM. Want to read a source code file? It has to be small enough to read. Need to work on more than one file? Your realistic limit is three to four files at once. To solve this, we can instruct the LLM to automate its own workflow with tools, where each step runs in a Docker container.
Again, each step in this workflow runs in a Docker container, which ensures a consistent and isolated environment for running tools and scripts. The first four steps prepare the agent to be able to extract the right context for fixing violations. Once the agent has the necessary context, the LLM can effectively fix the code issues in step 5.
1. Generate violations report using Pylint:
Run Pylint to produce a violation report.
2. Create a SQLite database:
Set up the database schema to store violation data and code snippets.
3. Generate and run INSERT statements:
Decouple violations from the range they represent.
Use a script to convert every violation and range from the report into SQL insert statements.
Run the statements against the database to populate it with the necessary data.
4. Index code in the database:
Generate an abstract syntax tree (AST) of the project with Tree-sitter (Figure 1).
Find all second-level nodes (Figure 2). In Python’s grammar, second-level nodes are statements inside of a module.
Index these top-level ranges into the database.
Populate a new table to store the source code at these top-level ranges.
5. Fix violations based on context:
Once the agent has gathered and indexed the necessary context, use prompts to instruct the LLM to query the database and fix the code issues (Figure 3).
Each step from 1 to 4 builds the foundation for step 5, where the LLM, with the proper context, can effectively fix violations. The structured preparation ensures that the LLM has all the information it needs to address code issues with precision.
Refining the context for LLM fixes
To understand how our system improves code fixes, let’s consider a specific violation flagged by Pylint. Say we receive a message that there’s a violation on line 60 of our code file block_listed_name.py:
From this Pylint violation, we know that the variable foo is a disallowed name. However, if we tried to ask the LLM to fix this issue based solely on this snippet of information, the response wouldn’t be as effective. Why? The LLM lacks the surrounding context — the full picture of the function in which this violation occurs.
This is where indexing the codebase becomes essential
Because we’ve mapped out the codebase, we can now ask the LLM to query the index and retrieve the surrounding code that includes the do_front function. The LLM can even generate the SQL query for us, thanks to its knowledge of the database schema. Once we’ve retrieved the full function definition, the LLM can work with a more complete view of the problem:
def do_front(front_filename, back_filename):
"""
Front strategy: loop over front image,
detect blue pixels there,
substitute in pixels from back.
Return changed front image.
"""
foo = SimpleImage(front_filename)
back = SimpleImage(back_filename)
for y in range(foo.height):xc
for x in range(foo.width):
pixel = foo.get_pixel(x, y)
# Detect blue pixels in front and replace with back pixels
if pixel[2] > 2 * max(pixel[0], pixel[1]):
back_pixel = back.get_pixel(x, y)
foo.set_pixel(x, y, back_pixel)
return foo
Now that the LLM can see the whole function, it’s able to propose a more meaningful fix. Here’s what it suggests after querying the indexed codebase and running the fix:
def do_front(front_filename, back_filename):
"""
Front strategy: loop over front image,
detect blue pixels there,
substitute in pixels from back.
Return changed front image.
"""
front_image = SimpleImage(front)
back_image = SimpleImage(back_filename)
for y in range(front_image.height):
for x in range(front_image.width pixel = front_image.get_pixel(x y)
# Detect blue pixels in front and replace with back pixels
if pixel[2 > * max(pixel[0 pixel[1]):
back_pixel = back_image.get_pixel(x,)
front_image.set_pixel(x,, back_pixel)
return front_image
Here, the variable foo has been replaced with the more descriptive front_image, making the code more readable and understandable. The key step was providing the LLM with the correct level of detail — the top-level range — instead of just a single line or violation message. With the right context, the LLM’s ability to fix code becomes much more effective, which ultimately streamlines the development process.
Remember, all of this information is retrieved and indexed by the LLM itself through the prompts we’ve set up. Through this series of prompts, we’ve reached a point where the assistant has a comprehensive understanding of the codebase.
At this stage, not only can I ask for a fix, but I can even ask questions like “what’s the violation at line 60 in naming_conventions/block_listed_name.py?” and the assistant responds with:
On line 60 of naming_conventions/block_listed_name.py, there's a violation: Disallowed name 'foo'. The variable name 'foo' is discouraged because it doesn't convey meaningful information about its purpose.
Although Pylint has been our focus here, this approach points to a new conversational way to interact with many tools that map code to issues. By integrating LLMs with containerized tools through architectures like the Docker prompt runner, we can enhance various aspects of the development workflow.
We’ve learned that combining tool integration, cognitive preparation of the LLM, and a seamless workflow can significantly improve the development experience. This integration allows an LLM to use tools to directly help while developing, and while Pylint has been the focus here, this also points to a new conversational way to interact with many tools that map code to issues.
To follow along with this effort, check out the GitHub repository for this project.
We are thrilled to announce that Granite models, IBM’s family of open source and proprietary models built for business, as well as Red Hat InstructLab model alignment tools, are now available on Docker Hub.
Now, developer teams can easily access, deploy, and scale applications using IBM’s AI models specifically designed for developers.
This news will be officially announced during the AI track of the keynote at IBM TechXchange on October 22. Attendees will get an exclusive look at how IBM’s Granite models on Docker Hub accelerate AI-driven application development across multiple programming languages.
Why Granite on Docker Hub?
With a principled approach to data transparency, model alignment, and security, IBM’s open source Granite models represent a significant leap forward in natural language processing. The models are available under an Apache 2.0 license, empowering developer teams to bring generative AI into mission-critical applications and workflows.
Granite models deliver superior performance in coding and targeted language tasks at lower latencies, all while requiring a fraction of the compute resources and reducing the cost of inference. This efficiency allows developers to experiment, build, and scale generative AI applications both on-premises and in the cloud, all within departmental budgetary limits.
Here’s what this means for you:
Simplified deployment: Pull the Granite image from Docker Hub and get up and running in minutes.
Scalability: Docker offers a lightweight and efficient method for scaling artificial intelligence and machine learning (AI/ML) applications. It allows you to run multiple containers on a single machine or distribute them across different machines in a cluster, enabling horizontal scalability.
Flexibility: Customize and extend the model to suit your specific needs without worrying about underlying infrastructure.
Portability: By creating Docker images once and deploying them anywhere, you eliminate compatibility problems and reduce the need for configurations.
Community support: Leverage the vast Docker and IBM communities for support, extensions, and collaborations.
In addition to the IBM Granite models, Red Hat also made the InstructLab model alignment tools available on Docker Hub. Developers using InstructLab can adapt pre-trained LLMs using far less real-world data and computing resources than alternative methodologies. InstructLab is model-agnostic and can be used to fine-tune any LLM of your choice by providing additional skills and knowledge.
With IBM Granite AI models and InstructLab available on Docker Hub, Docker and IBM enable easy integration into existing environments and workflows.
Getting started with Granite
You can find the following images available on Docker Hub:
InstructLab: Ideal for desktop or Mac users looking to explore InstructLab, this image provides a simple introduction to the platform without requiring specialized hardware. It’s perfect for prototyping and testing before scaling up.
Granite-7b-lab: This image is optimized for model serving and inference on desktop or Mac environments, using the Granite-7B model. It allows for efficient and scalable inference tasks without needing a GPU, perfect for smaller-scale deployments or local testing.
How to pull and run IBM Granite images from Docker Hub
IBM Granite provides a toolset for building and managing cloud-native applications. Follow these steps to pull and run an IBM Granite image using Docker and the CLI. You can follow similar steps for the Red Hat InstructLab images.
Authenticate to Docker Hub
Enter your Docker username and password when prompted.
Pull the IBM Granite Image
Pull the IBM Granite image from Docker Hub.
redhat/granite-7b-lab-gguf: For Mac/desktop users with no GPU support
Run the Image in a Container
Start a container with the IBM Granite image. The container can be started in two modes: CLI (default) and server.
To start the container in CLI mode, run the following: docker run --ipc=host -it redhat/granite-7b-lab-gguf
This command opens an interactive bash session within the container, allowing you to use the tools.
To run the container in server mode, run the following command:
docker run --ipc=host -it redhat/granite-7b-lab-gguf -s
Granite on Docker Hub will be officially announced at the IBM TechXchange Conference, which will be held October 21-24 in Las Vegas. Our head of technical alliances, Eli Aleyner, will show a live demonstration at the AI track of the keynote during IBM TechXchange. Oleg Šelajev, Docker’s staff developer evangelist, will show how app developers can test their GenAI apps with local models. Additionally, you’ll learn how Docker’s collaboration with Red Hat is improving developer productivity.
The availability of Granite on Docker Hub marks a significant milestone in making advanced AI models accessible to all. We’re excited to see how developer teams will harness the power of Granite to innovate and solve complex challenges.
Stay anchored for more updates, and as always, happy coding!
This ongoing Docker Labs GenAI series explores the exciting space of AI developer tools. At Docker, we believe there is a vast scope to explore, openly and without the hype. We will share our explorations and collaborate with the developer community in real time. Although developers have adopted autocomplete tooling like GitHub Copilot and use chat, there is significant potential for AI tools to assist with more specific tasks and interfaces throughout the entire software lifecycle. Therefore, our exploration will be broad. We will be releasing software as open source so you can play, explore, and hack with us, too.
LLMs are now quite good at transforming data. For example, we were recently working with some data generated by the Pylint tool. This tool generates big arrays of code violations.
Here’s an example showing the kind of data that gets returned.
During this session with our AI assistant, we decided that it would be helpful to create a database and insert the data to make it easier for the AI to analyze (LLMs are very good at writing SQL). As is now our habit, we wrote a quick prompt to see if the assistant could generate the SQL:
1. Read the json data from /thread/violations.json
2. For each element in the array, transform each element into two SQL INSERT statements.
* The first should insert columns PATH, START_LINE, END_LINE, START_COLUMN, END_COLUMN into a table named RANGES.
* The second should insert columns MESSAGE, TYPE, RANGE, and VIOLATION_ID into a table named VIOLATIONS.
LLMs are obviously good at tasks of this kind, and this was no exception. Our prompt engine had been augmented with a function to read local files but, besides that, this was a pretty straightforward prompt (we used GPT-3.5). The LLM responded with the correct INSERT statements.
INSERT INTO RANGES (PATH, START_LINE, END_LINE, START_COLUMN, END_COLUMN)
VALUES ('src/app.py', 1, 1, 0, 13);
INSERT INTO VIOLATIONS (MESSAGE, TYPE, RANGE, VIOLATION_ID)
VALUES ('Missing module docstring', 'convention', (SELECT ID FROM RANGES WHERE PATH='src/app.py' AND START_LINE=1 AND START_COLUMN=0), 'C0114');
...
...
We’re starting to get accustomed to this kind of capability, so the result wasn’t too surprising. However, what about the context window here? It really doesn’t make a lot of sense to pass all of this data to the LLM, especially if this is a task that we’ll need to do continuously. Plus, this is really not how a programmer would have solved the problem.
Programmers would write programs.
So, instead of asking the LLM to do a thing, we should try asking it to write a program to do that same thing. Starting with the same prompt, let’s prefix it with “Please write a JavaScript program to …” In other words, let’s ask the LLM to describe how it would accomplish the task in JavaScript, and hopefully automate itself out of the loop. Here’s the adjusted prompt:
Write a javascript function that reads json from a file named `/thread/violations.json`.
For each element in the array, transform each element into two SQL INSERT statements.
* The first should insert columns PATH, START_LINE, END_LINE, START_COLUMN, END_COLUMN into a table named RANGES.
* The second should insert columns MESSAGE, TYPE, RANGE, and VIOLATION_ID into a table named VIOLATIONS.
and write them to the file /thread/insert.sql
Now execute the javascript code in a container.
Tool code
Current LLMs can do tasks of this kind. Here’s a sample of the output generated by GPT-3.5. A quick scan will convince many of you that this is probably going to do the trick. However, if our prompt engine is already running in Docker, we might as well just run it:
const fs = require('fs');
const crypto = require('crypto');
function getRandomId() {
return crypto.randomBytes(16).toString('hex');
}
function escapeString(str) {
return str.replace(/'/g, \\\"''\\\");
}
let data = JSON.parse(fs.readFileSync('/thread/violations.json', 'utf8'));
let sqlStatements = '';
data.forEach((violation) => {
let id = getRandomId();
sqlStatements += `INSERT INTO RANGES(ID, PATH, START_LINE, END_LINE, START_COLUMN, END_COLUMN) VALUES('${id}', '${escapeString(violation.path)}', ${violation.line}, ${violation.endLine}, ${violation.column}, ${violation.endColumn});\\\\n`;
sqlStatements += `INSERT INTO VIOLATIONS(MESSAGE, TYPE, RANGE) VALUES('${escapeString(violation.message)}', '${escapeString(violation.type)}', '${id}');\\\\n`;
});
fs.writeFileSync('/thread/insert.sql', sqlStatements);
Running the tool
You may not have noticed but at the end of our prompt, we added a final instruction: “Now execute the JavaScript code in a container”. This is a nice addition to our session, because it means we get to see the results.
This is also where tool calling comes back into the picture. To give our AI the capacity to try running the program that it has just written, we have defined a new function to create an isolated runtime sandbox for trying out our new tool.
Here’s the agent’s new tool definition:
tools:
- name: run-javascript-sandbox
description: execute javascript code in a container
parameters:
type: object
properties:
javascript:
type: string
description: the javascript code to run
container:
image: vonwig/javascript-runner
command:
- "{{javascript|safe}}"
We’ve asked the AI assistant to generate a tool from a description of that tool. As long as the description of the tools doesn’t change, the workflow won’t have to go back to the AI to ask it to build a new tool version.
The role of Docker in this pattern is to create the sandbox for this code to run. This function really doesn’t need much of a runtime, so we give it a pretty small sandbox.
No access to a network.
No access to the host file system (does have access to isolated volumes for sharing data between tools).
No access to GPU.
Almost no access to software besides the Node.js runtime (no shell for example).
The ability for one tool to create another tool is not just a trick. It has very practical implications for the kinds of workflows that we can build up because it gives us a way for us to control the volume of data sent to LLMs, and it gives the assistant a way to “automate” itself out of the loop.
Next steps
This example was a bit abstract but in our next post, we will describe the practical scenarios that have driven us to look at this idea of prompts generating new tools. Most of the workflows we’re exploring are still just off-the-shelf tools like Pylint, SQLite, and tree_sitter (which we embed using Docker, of course!). For example:
Use pylint to extract violations from my codebase.
Transform the violations into SQL and then send that to a new SQLite.
Find the most common violations of type error and show me the top level code blocks containing them.
However, you’ll also see that part of being able to author workflows of this kind is being able to recognize when you just need to add a custom tool to the mix.
This ongoing Docker Labs GenAI series explores the exciting space of AI developer tools. At Docker, we believe there is a vast scope to explore, openly and without the hype. We will share our explorations and collaborate with the developer community in real-time. Although developers have adopted autocomplete tooling like GitHub Copilot and use chat, there is significant potential for AI tools to assist with more specific tasks and interfaces throughout the entire software lifecycle. Therefore, our exploration will be broad. We will be releasing software as open source so you can play, explore, and hack with us, too.
Using new tools on the command line can be frustrating. Even if we are confident that we’ve found the right tool, we might not know how to use it.
Telling an agent to RT(F)M
A typical workflow might look something like the following.
Install tool.
Read the documentation.
Run the command.
Repeat.
Can we improve this flow using LLMs?
Install tool
Docker provides us with isolated environments to run tools. Instead of requiring that commands be installed, we have created minimal Docker images for each tool so that using the tool does not impact the host system. Leave no trace, so to speak.
Read the documentation
Man pages are one of the ways that authors of tools ship content about how to use that tool. This content also comes with standard retrieval mechanisms (the man tool). A tool might also support a command-line option like --help. Let’s start with the idealistic notion that we should be able to retrieve usage information from the tool itself.
In this experiment, we’ve created two entry points for each tool. The first entry point is the obvious one. It is a set of arguments passed directly to a command-line program. The OpenAI-compatible description that we generate for this entry point is shown below. We are using the same interface for every tool.
{"name": "run_my_tool",
"description": "Run the my_tool command.",
"parameters":
{"type": "object",
"properties":
{"args":
{"type": "string",
"description": "The arguments to pass to my_tool"}}},
"container": {"image": "namespace/my_tool:latest"}}
The second entrypoint gives the agent the ability to read the man page and, hopefully, improve its ability to run the first entrypoint. The second entrypoint is simpler, because it only does one thing (asks a tool how to use it).
{"name": "my_tool_manual",
"description": "Read the man page for my_tool",
"container": {"image": "namespace/my_tool:latest", "command": ["man"]}}
Run the command
Let’s start with a simple example. We want to use a tool called qrencode to generate a QR code for a link. We have used our image generation pipeline to package this tool into a minimal image for qrencode. We will now pass this prompt to a few different LLMs; we are using LLMs that have been trained for tool calling (e.g., GPT 4, Llama 3.1, and Mistral). Here’s the prompt that we are testing:
Generate a QR code for the content https://github.com/docker/labs-ai-tools-for-devs/blob/main/prompts/qrencode/README.md. Save the generated image to qrcode.png.
If the command fails, read the man page and try again.
Note the optimism in this prompt. Because it’s hard to predict what different LLMs have already seen in their training sets, and many command-line tools use common names for arguments, it’s interesting to see what LLM will infer before adding the man page to the context.
The output of the prompt is shown below. Grab your phone and check it out.
Repeat
When an LLM generates a description of how to run something, it will usually format that output in such a way that it will be easy for a user to cut and paste the response into a terminal:
qrencode -o qrcode.png 'my content'
However, if the LLM is generating tool calls, we’ll see output that is instead formatted to be easier to run:
We respond to this by spinning up a Docker container.
Running the tool as part of the conversation loop is useful even when the command fails. In Unix, there are standard ways to communicate failures. For example, we have exit codes, and stderr streams. This is how tools create feedback loops and correct our behavior while we’re iterating at the terminal. This same mechanism can be used in a conversation loop involving an LLM.
To illustrate, here is another simple example. We’ll try running a tool that writes ASCII art with the following prompt.
Write ascii art with the message "Docker" using the font "helvetica". Read the man page for figlet if you detect any errors and then try again.
In our test, this did fail. However, it also described the apparent issue on the stderr output stream.
Error: error: could not load font helvetica
By including this message in the conversation loop, the assistant can suggest different courses of action. Different LLMs produced different results here. For example, Llama 3.1 gives instructions for how to install the missing font. On the other hand, GPT 4 re-ran the command, but only after having made the “executive” decision to try a different font.
I'm sorry, it seems the font Helvetica is not available. Let's try it with the default font.
Here is the output of the figlet command with the default font (term):
____ _
| \ _ | | _____ _
| | | |/ \ / _| |/ / \ '_|
| |_| | (_) | (__| < __/ |
|____/ \___/ \___|_|\_\___|_|
We are very early in understanding how to take advantage of this apparent capacity to try different approaches. But this is another reason why quarantining these tools in Docker containers is useful. It limits their blast radius while we encourage experimentation.
Results
We started by creating a pipeline to produce minimal Docker images for each tool. The set of tools was selected based on whether they have outputs useful for developer-facing workflows. We continue to add new tools as we think of new use cases. The initial set is listed below.
There was a set of initial problems with context extraction.
Missing manual pages
Only about 60% of the tools we selected have man pages. However, even in those cases, there are usually other ways to get help content. The following steps show the final procedure we used:
Try to run the man page.
Try to run the tool with the argument --help.
Try to run the tool with the argument -h.
Try to run the tool with --broken args and then read stderr.
Using this procedure, every tool in the list above eventually produced documentation.
Long manual pages
Limited context lengths impacted some of the longer manual pages, so it was still necessary to employ standard RAG techniques to summarize verbose man pages. Our tactic was to focus on descriptions of command-line arguments and sections that had sample usage. These had the largest impact on the quality of the agent’s output. The structure of Unix man pages helped with the chunking, because we were able to rely on standard sections to chunk the content.
Subcommands
For a small set of tools, it was necessary to traverse a tree of help menus. However, these were all relatively popular tools, and the LLMs we deployed already knew about this command structure. It’s easy to check this out for yourself. Ask an LLM, for example: “What are the subcommands of Git?” or “What are the subcommands of Docker?” Maybe only popular tools get big enough that they start to be broken up into subcommands.
Summary
We should consider the active role that agents can play when determining how to use a tool. The Unix model has given us standards such as man pages, stderr streams, and exit codes, and we can take advantage of these conventions when asking an assistant to learn a tool. Beyond distribution, Docker also provides us with process isolation, which is useful when creating environments for safe exploration.
Whether or not an AI can successfully generate tool calls may also become a metric for whether or not a tool has been well documented.
As artificial intelligence (AI) has erupted, Secure by Design for AI has emerged as a critical paradigm. AI is integrating into every aspect of our lives — from healthcare and finance to developers to autonomous vehicles and smart cities — and its integration into critical infrastructure has necessitated that we move quickly to understand and combat threats.
Necessity of Secure by Design for AI
AI’s rapid integration into critical infrastructure has accelerated the need to understand and combat potential threats. Security measures must be embedded into AI products from the beginning and evolve as the model evolves. This proactive approach ensures that AI systems are resilient against emerging threats and can adapt to new challenges as they arise. In this article, we will explore two polarizing examples — the developer industry and the healthcare industry.
Complexities of threat modeling in AI
AI brings forth new challenges and conundrums when working on an accurate threat model. Before reaching a state in which the data has simple edit and validation checks that can be programmed systematically, AI validation checks need to learn with the system and focus on data manipulation, corruption, and extraction.
Data poisoning: Data poisoning is a significant risk in AI, where the integrity of the data used by the system can be compromised. This can happen intentionally or unintentionally and can lead to severe consequences. For example, bias and discrimination in AI systems have already led to issues, such as the wrongful arrest of a man in Detroit due to a false facial recognition match. Such incidents highlight the importance of unbiased models and diverse data sets. Testing for bias and involving a diverse workforce in the development process are critical steps in mitigating these risks.
In healthcare, for example, bias may be simpler to detect. You can examine data fields based on areas such as gender, race, etc.
In development tools, bias is less clear-cut. Bias could result from the underrepresentation of certain development languages, such as Clojure. Bias may even result from code samples based on regional differences in coding preferences and teachings. In developer tools, you likely won’t have the information available to detect this bias. IP addresses may give you information about where a person is living currently, but not about where they grew up or learned to code. Therefore, detecting bias will be more difficult.
Data manipulation: Attackers can manipulate data sets with malicious intent, altering how AI systems behave.
Privacy violations: Without proper data controls, personal or sensitive information could unintentionally be introduced into the system, potentially leading to privacy violations. Establishing strong data management practices to prevent such scenarios is crucial.
Evasion and abuse: Malicious actors may attempt to alter inputs to manipulate how an AI system responds, thereby compromising its integrity. There’s also the potential for AI systems to be abused in ways developers did not anticipate. For example, AI-driven impersonation scams have led to significant financial losses, such as the case where an employee transferred $26 million to scammers impersonating the company’s CFO.
These examples underscore the need for controls at various points in the AI data lifecycle to identify and mitigate “bad data” and ensure the security and reliability of AI systems.
Key areas for implementing Secure by Design in AI
To effectively secure AI systems, implementing controls in three major areas is essential (Figure 1):
1. Data management
The key to data management is to understand what data needs to be collected to train the model, to identify the sensitive data fields, and to prevent the collection of unnecessary data. Data management also involves ensuring you have the correct checks and balances to prevent the collection of unneeded data or bad data.
In healthcare, sensitive data fields are easy to identify. Doctors offices often collect national identifiers, such as drivers licenses, passports, and social security numbers. They also collect date of birth, race, and many other sensitive data fields. If the tool is aimed at helping doctors identify potential conditions faster based on symptoms, you would need anonymized data but would still need to collect certain factors such as age and race. You would not need to collect national identifiers.
In developer tools, sensitive data may not be as clearly defined. For example, an environment variable may be used to pass secrets or pass confidential information, such as the image name from the developer to the AI tool. There may be secrets in fields you would not suspect. Data management in this scenario involves blocking the collection of fields where sensitive data could exist and/or ensuring there are mechanisms to scrub sensitive data built into the tool so that data does not make it to the model.
Data management should include the following:
Implementing checks for unexpected data: In healthcare, this process may involve “allow” lists for certain data fields to prevent collecting irrelevant or harmful information. In developer tools, it’s about ensuring the model isn’t trained on malicious code, such as unsanitized inputs that could introduce vulnerabilities.
Evaluating the legitimacy of users and their activities: In healthcare tools, this step could mean verifying that users are licensed professionals, while in developer tools, it might involve detecting and mitigating the impact of bot accounts or spam users.
Continuous data auditing: This process ensures that unexpected data is not collected and that the data checks are updated as needed.
2. Alerting and monitoring
With AI, alerting and monitoring is imperative to ensuring the health of the data model. Controls must be both adaptive and configurable to detect anomalous and malicious activities. As AI systems grow and adapt, so too must the controls. Establish thresholds for data, automate adjustments where possible, and conduct manual reviews where necessary.
In a healthcare AI tool, you might set a threshold before new data is surfaced to ensure its accuracy. For example, if patients begin reporting a new symptom that is believed to be associated with diabetes, you may not report this to doctors until it is reported by a certain percentage (15%) of total patients.
In a developer tool, this might involve determining when new code should be incorporated into the model as a prompt for other users. The model would need to be able to log and analyze user queries and feedback, track unhandled or poorly handled requests, and detect new patterns in usage. Data should be analyzed for high frequencies of unhandled prompts, and alerts should be generated to ensure that additional data sets are reviewed and added to the model.
3. Model tuning and maintenance
Producers of AI tools should regularly review and adjust AI models to ensure they remain secure. This includes monitoring for unexpected data, adjusting algorithms as needed, and ensuring that sensitive data is scrubbed or redacted appropriately.
For healthcare, model tuning may be more intensive. Results may be compared to published medical studies to ensure that patient conditions are in line with other baselines established across the world. Audits should also be conducted to ensure that doctors with reported malpractice claims or doctors whose medical license has been revoked are scrubbed from the system to ensure that potentially compromised data sets are not influencing the model.
In a developer tool, model tuning will look very different. You may look at hyperparameter optimization using techniques such as grid search, random search, and Bayesian search. You may study subsets of data; for example, you may perform regular reviews of the most recent data looking for new programming languages, frameworks, or coding practices.
Model tuning and maintenance should include the following:
Perform data audits to ensure data integrity and that unnecessary data is not inadvertently being collected.
Review whether “allow” lists and “deny” lists need to be updated.
Regularly audit and monitor alerts for algorithms to determine if adjustments need to be made; consider the population of your user base and how the model is being trained when adjusting these parameters.
Ensure you have the controls in place to isolate data sets for removal if a source has become compromised; consider unique identifiers that allow you to identify a source without providing unnecessary sensitive data.
Regularly back up data models so you can return to a previous version without heavy loss of data if the source becomes compromised.
AI security begins with design
Security must be a foundational aspect of AI development, not an afterthought. By identifying data fields upfront, conducting thorough AI threat modeling, implementing robust data management controls, and continuously tuning and maintaining models, organizations can build AI systems that are secure by design.
This approach protects against potential threats and ensures that AI systems remain reliable, trustworthy, and compliant with regulatory requirements as they evolve alongside their user base.