Reading view

There are new articles available, click to refresh the page.

WatchThis: A Wearable Point-and-Ask Interface Powered by Vision-Language Models and XIAO ESP32S3 Sense

MIT Media Lab researchers Cathy Mengying Fang, Patrick Chwalek, Quincy Kuang, and Pattie Maes have developed WatchThis, a groundbreaking wearable device that enables natural language interactions with real-world objects through simple pointing gestures. Cathy conceived the idea for WatchThis during a one-day hackathon in Shenzhen, organized as part of MIT Media Lab’s “Research at Scale” initiative. Organized by Cedric Honnet and hosted by Southern University of Science and Technology and Seeed Studio, the hackathon provided the perfect setting to prototype this innovative device using components from the Seeed Studio XIAO ESP32S3 suite. By integrating Vision-Language Models (VLMs) with a compact wrist-worn device, WatchThis allows users to ask questions about their surroundings in real-time, making contextual queries as intuitive as pointing and asking.

Credit: Cathy Fang

Hardwares

The WatchThis project utilizes the following hardware components:

Credit: Cathy Fang

How the Project Works

WatchThis is designed to seamlessly integrate natural, gesture-based interaction into daily life. The wearable device consists of a watch with a rotating, flip-up camera attached to the back of a display. When the user points at an object of interest, the camera captures the area, and the device processes contextual queries based on the user’s gesture.

The interaction begins when the user flips up the watch body to reveal the camera, which then captures the area where the finger points at. The watch’s display shows a live feed from the camera, allowing precise aiming. When the user touches the screen, the device captures the image and pauses the camera feed. The captured RGB image is then compressed into JPG format and converted to base64, after which an API request is made to query the image.

The device uses these API calls to interact with OpenAI’s GPT-4o model, which accepts both text and image inputs. This allows the user to ask questions such as “What is this?” or “Translate this,” and receive immediate responses. The text response is displayed on the screen, overlaid on the captured image. After the response is shown for 3 seconds, the screen returns to streaming the camera feed, ready for the next command.

The software driving WatchThis is written in Arduino-compatible C++ and runs directly on the device. It is optimized for quick and efficient performance, with an end-to-end response time of around 3 seconds. Instead of relying on voice recognition or text-to-speech—which can be error-prone and resource-intensive—the system uses direct text input for queries. Users can further personalize their interactions by modifying the default query prompt through an accompanying WebApp served on the device, allowing tailored actions such as identifying objects, translating text, or requesting instructions.

Credit: Cathy Fang

Applications

Imagine strolling through a city and pointing at a building to learn its history, or identifying an exotic plant in a botanical garden with a mere gesture.

The device goes beyond simple identification, offering practical applications like real-time translation of, for example, menu items, which is a game-changer for travelers and language learners alike.

The research team has discussed even more exciting potential applications:

    • A “Remember this” function could serve as a visual reminder system, potentially aiding those who need to take medication regularly.
    • For urban explorers, a “How do I get there” feature could provide intuitive, spatially-aware navigation by allowing users to point at distant landmarks.
    • A “Zoom in on that” capability could offer a closer look at far-off objects without disrupting the user’s activities.
    • Perhaps most intriguingly, a “Turn that off” function could allow users to control smart home devices with a combination of voice commands and gestures, seamlessly integrating with IoT ecosystems.

While some of these features are still in conceptual stages, they paint a picture of a future where our interactions with the world around us are more intuitive, informative, and effortless than ever before.

Credit: Cathy Fang

Build Your Own WatchThis

Interested in building your own WatchThis wearable? Explore the open-source hardware and software components on GitHub and start creating today! Check out their paper below for full details.

End Note

Hey community, we’re curating a monthly newsletter centering around the beloved Seeed Studio XIAO. If you want to stay up-to-date with:

🤖 Cool Projects from the Community to get inspiration and tutorials
📰 Product Updates: firmware update, new product spoiler
📖 Wiki Updates: new wikis + wiki contribution
📣 News: events, contests, and other community stuff

Please click the image below👇 to subscribe now!

The post WatchThis: A Wearable Point-and-Ask Interface Powered by Vision-Language Models and XIAO ESP32S3 Sense appeared first on Latest Open Tech From Seeed.

Revolutionize Your Email Workflow with AI

We are happy to announce the release of Stalwart Mail Server v0.10.3, which introduces support for AI models —a powerful new feature now available to Enterprise Edition users as well as our GitHub and OpenCollective sponsors. With this feature, Stalwart Mail Server can be integrated with both self-hosted and cloud-based Large Language Models (LLMs), bringing advanced email processing capabilities like never before.

This integration allows you to use AI models for a variety of tasks, including enhanced spam filtering, threat detection, and intelligent email classification. Whether you choose to host your own models with LocalAI or leverage cloud-based services like OpenAI or Anthropic, this release provides the flexibility to incorporate cutting-edge AI into your email infrastructure.

Unlocking the Power of AI

With the introduction of AI model integration, Stalwart Mail Server can now analyze email content more deeply than traditional filters ever could. For instance, in the realm of spam filtering and threat detection, AI models are highly effective at identifying patterns and detecting malicious or unsolicited content. The system works by analyzing both the subject and body of incoming emails through the lens of an LLM, providing more accurate detection and filtering.

In addition to bolstering security, AI integration enhances email classification. By configuring customized prompts, administrators can instruct AI models to categorize emails based on their content, leading to more precise filtering and organization. This is particularly useful for enterprises managing a high volume of messages that span various topics and departments, as AI-driven filters can quickly and intelligently sort messages into categories like marketing, personal correspondence, or work-related discussions.

The flexibility of using either self-hosted or cloud-based AI models means that Stalwart can be tailored to your infrastructure and performance needs. Self-hosting AI models ensures full control over data and privacy, while cloud-based models offer ease of setup and access to highly optimized, continuously updated language models.

LLMs in Sieve Scripts

One of the most exciting features of this release is the ability for users and administrators to access AI models directly from Sieve scripts. Stalwart extends the Sieve scripting language by introducing the llm_prompt function, which allows users to send prompts and email content to the AI model for advanced processing.

For example, the following Sieve script demonstrates how an AI model can be used to classify emails into specific folders based on the content:

require ["fileinto", "vnd.stalwart.expressions"];

# Base prompt for email classification
let "prompt" '''You are an AI assistant tasked with classifying personal emails into specific folders.
Your job is to analyze the email's subject and body, then determine the most appropriate folder for filing.
Use only the folder names provided in your response.
If the category is not clear, respond with "Inbox".

Classification Rules:
- Family:
* File here if the message is signed by a Doe family member
* The recipient's name is John Doe
- Cycling:
* File here if the message is related to cycling
* File here if the message mentions the term "MAMIL"
- Work:
* File here if the message mentions "Dunder Mifflin Paper Company, Inc." or any part of this name
* File here if the message is related to paper supplies
* Only classify as Work if it seems to be part of an existing sales thread or directly related to the company's operations
- Junk Mail:
* File here if the message is trying to sell something and is not work-related
* Remember that John lives a minimalistic lifestyle and is not interested in purchasing items
- Inbox:
* Use this if the message doesn't clearly fit into any of the above categories

Analyze the following email and respond with only one of these folder names: Family, Cycling, Work, Junk Mail, or Inbox.
''';

# Prepare the base Subject and Body
let "subject" "thread_name(header.subject)";
let "body" "body.to_text";

# Send the prompt, subject, and body to the AI model
let "llm_response" "llm_prompt('gpt-4', prompt + '\n\nSubject: ' + subject + '\n\n' + body, 0.6)";

# Set the folder name
if eval "contains(['Family', 'Cycling', 'Work', 'Junk Mail'], llm_response)" {
fileinto "llm_response";
}

This example demonstrates how the llm_prompt function can be used to classify emails into different categories such as Family, Cycling, Work, or Junk Mail based on the content. The AI model analyzes the message’s subject and body according to the classification rules defined in the prompt and returns the most appropriate folder name. The email is then automatically filed into the correct folder, making it easier to organize incoming messages based on their content.

Self-Hosted or Cloud-Based

With this new feature, Stalwart Mail Server allows for seamless integration with both self-hosted and cloud-based AI models. If you prefer full control over your infrastructure, you can opt to deploy models on your own hardware using solutions like LocalAI. Self-hosting gives you complete ownership over your data and ensures compliance with privacy policies, but it may require significant computational resources, such as GPU acceleration, to maintain high performance.

Alternatively, you can integrate with cloud-based AI providers like OpenAI or Anthropic, which offer access to powerful, pretrained models with minimal setup. Cloud-based models provide cutting-edge language processing capabilities, but you should be aware of potential costs, as these providers typically charge based on the number of tokens processed. Whether you choose self-hosted or cloud-based models, Stalwart gives you the flexibility to tailor the AI integration to your specific needs.

Available for Enterprise Users and Sponsors

This exciting AI integration feature is exclusively available for Enterprise Edition users as well as GitHub and OpenCollective monthly sponsors. If you want to harness the full potential of AI-powered email processing in Stalwart Mail Server, upgrading to the Enterprise Edition or becoming a sponsor is a great way to access this feature and other advanced capabilities.

Try It Out Today!

The release of Stalwart Mail Server v0.10.3 marks a major milestone in our journey toward building intelligent, highly customizable email management solutions. By combining traditional email filtering with the power of LLMs, Stalwart gives you the tools to take your email infrastructure to the next level, enhancing security, organization, and automation in ways that were previously impossible. We’re excited to see how you’ll use this new feature to optimize your email workflows!

❌