Tag: human-robot interaction

  • MIT Develops Long-Term Memory System That Lets Robots Answer Where You Left Your Keys

    MIT Develops Long-Term Memory System That Lets Robots Answer Where You Left Your Keys

    Imagine asking a robot, “Where did I leave my keys?” and getting an accurate, real-time answer. MIT researchers have created a new spatial memory framework called DAAAM (Describe Anything, Anywhere, Anytime, at Any Moment) that gives robots the ability to form and recall detailed mental models of large-scale environments. This breakthrough could transform how robots assist humans in factories, homes, and beyond.

    DAAAM combines advanced map representations with rich, language-based descriptions of objects a robot encounters as it explores. The system runs fast enough for mobile robots to use in real-time, answering complex queries in plain English with 21% to 53% higher accuracy than existing methods.

    “If we want robots to work side-by-side with humans and interact better with humans, they must speak the same language,” says Luca Carlone, associate professor in MIT’s Department of Aeronautics and Astronautics and lead researcher on the project. “The robot must be able to reason about time and space the same way humans do.”

    The framework bridges computer vision and robotic mapping. As a robot moves through an environment, DAAAM attaches detailed descriptions to objects—like noting that a red bicycle with a flat tire is in the bike rack outside the Stata Center. It stores this information in a 3D map-based representation arranged spatially, grouping objects into regions for efficient retrieval.

    To overcome the speed limitations of existing annotation techniques, DAAAM aggregates nearby objects and uses an optimization method to select key frames—images with the clearest view of multiple objects—allowing the system to describe several items in parallel. This speeds up computation tenfold, making real-time performance possible.

    “We annotate every object only once, so our framework can run in very large-scale environments in real time,” explains lead author Nicolas Gorlo, an MIT graduate student. “And by clustering objects into regions, it can answer a wide range of queries about objects and locations.”

    The researchers used a large language model (LLM) that calls on various tools to retrieve specific information quickly, reducing hallucinations. For example, if asked about a sculpture near an MIT campus building, DAAAM can use a semantic search tool to retrieve information based on the word “sculpture” or a location-based tool to find the building.

    Future work aims to expand DAAAM to capture significant events and incorporate confidence levels into responses. “Ultimately, we want to have robots that can help with any sort of tasks,” Gorlo says. “With this framework, we are trying to create the foundations to enable a generalist agent that can do anything you ask.”

    The research was presented at the Conference on Computer Vision and Pattern Recognition (CVPR) and funded by the U.S. Army Research Laboratory and the Office of Naval Research.

  • Two LLMs Team Up to Help Robots Interpret Vague Instructions and Prioritize What Matters

    Two LLMs Team Up to Help Robots Interpret Vague Instructions and Prioritize What Matters

    Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a new method that uses two large language models (LLMs) to help robots understand ambiguous instructions and focus on key details. The approach, called Masked Inverse Reinforcement Learning (Masked IRL), reduces the amount of demonstration data needed to teach a robot by nearly five times, while improving the robot’s ability to infer unspoken user preferences.

    Traditional robot training often requires either extensive physical demonstrations or detailed written instructions. Masked IRL automates the process: first, one LLM clarifies ambiguous prompts (e.g., turning “stay close” into “stay close to the surface of the table”) by comparing a user’s demonstration trajectory to the shortest possible path. Then a second LLM evaluates the environment and “masks” irrelevant details – such as a person leaning on a table – while highlighting critical ones like obstacles to avoid. The robot then uses these prioritized details to generate a safe motion plan.

    In experiments, the system correctly identified unstated user preferences up to 15 percent more often than comparable baselines. Real-world tests showed a robotic arm successfully moving a coffee mug around a laptop, wiping a table while “staying close” to it, and handing a user a bag of chips while “staying away” from both the person and the table – all after fewer than 50 kinesthetic demonstrations.

    The team plans to enhance Masked IRL with camera input, allowing robots to visually focus on relevant objects in dynamic environments. The work was supported by the Tata Group via the MIT Generative AI Impact Consortium Award and the Department of Defense, and will be presented at the 2026 IEEE International Conference on Robotics and Automation.