Community Debates Meta's MILS: Can LLMs Really "See and Hear" Without Training?

BigGo Editorial Team

Community Debates Meta's MILS: Can LLMs Really "See and Hear" Without Training?

Meta's FAIR (Facebook AI Research) recently published a paper titled LLMs can see and hear without any training, which has sparked significant debate within the AI community. The paper introduces MILS, a method that enables language models to perform multimodal tasks like image, audio, and video captioning without specific training for these modalities. However, the community's reaction suggests the title may be more provocative than the actual technical achievement.

The Actor-Critic Architecture by Another Name

At its core, MILS uses what many in the community immediately recognized as an Actor-Critic setup, though interestingly, this terminology is absent from the paper itself. The system employs a Generator (the LLM) and a Scorer (like CLIP) in an iterative process where the LLM generates captions and receives feedback from pre-trained scoring models.

Yes, apparently they've developed new names: Generator and Scorer. This feels a bit like Tai's Model

This approach has drawn comparisons to the Tai's Model phenomenon, where established concepts are rebranded with new terminology. The community points out that while the method is clever, the paper's framing suggests more novelty than may be warranted.

Title vs. Reality: Understanding the Claims

Many commenters took issue with the paper's title, suggesting it misrepresents what's actually happening. The system doesn't truly enable LLMs to see and hear in the way the title implies. Rather, it creates a feedback loop where the LLM iteratively improves its outputs based on scores from models that have been trained on visual or audio data.

The approach is somewhat analogous to a blind person playing Marco Polo, where they navigate toward a goal based on warmer or colder feedback. The LLM isn't directly processing visual or audio inputs but is instead using textual feedback about its guesses to converge on appropriate descriptions.

Emergent Capabilities or Clever Engineering?

Some defenders of the paper highlight that the approach demonstrates emergent capabilities of LLMs. Since the language model wasn't explicitly trained to interpret feedback from visual models and adjust accordingly, its ability to do so could be considered an emergent property. The LLM is effectively finding its way toward correct descriptions without having examples of this specific task in its training data.

However, critics point out that the system still relies heavily on pre-trained multimodal models like CLIP, which have indeed been trained on vast amounts of visual data. The debate centers on whether without any training is an accurate characterization when the system depends on other trained components.

Anthropomorphizing AI Capabilities

A recurring theme in the comments is concern about the anthropomorphizing language used to describe AI systems. Some commenters drew satirical parallels to simple devices like photoresistors and thermostats that can see darkness or feel temperature without any training or code.

While these analogies are clearly hyperbolic, they highlight a legitimate concern about how AI research is communicated. The use of human-like terms such as seeing and hearing may create misconceptions about what these systems are actually doing and how they work.

The community's reaction to this paper reflects broader tensions in AI research communication, where the pressure to produce attention-grabbing headlines sometimes conflicts with precise technical descriptions. As large research labs compete for attention and funding, there's a growing concern about unnecessary boosterism in how AI capabilities are framed.

Despite these criticisms, the technical approach described in the paper does represent an interesting method for leveraging LLMs in multimodal tasks without task-specific fine-tuning, even if the without any training claim requires significant qualification.

Reference: LLMs can see and hear without any training


The GitHub repository for Meta's MILS project, illustrating the technical foundation behind the controversial claims made about LLM capabilities