Modern NLP Techniques Reveal Language-Like Structure in Mysterious Voynich Manuscript

BigGo Editorial Team

Modern NLP Techniques Reveal Language-Like Structure in Mysterious Voynich Manuscript

The Voynich Manuscript, a mysterious 15th-century document filled with undecipherable text and strange illustrations, continues to baffle researchers centuries after its creation. A recent computational analysis using modern Natural Language Processing (NLP) techniques has revealed fascinating insights into the manuscript's structure, suggesting it contains patterns consistent with an actual language rather than random gibberish.

Structured Analysis Reveals Language-Like Patterns

The analysis employed several NLP techniques including clustering of stripped root words using multilingual SBERT (Sentence-BERT), identification of function-word-like versus content-word-like clusters, and Markov-style transition modeling. By stripping recurring suffix-like endings from words (such as aiin, dy, and chy), the researcher was able to isolate what appeared to be root forms that repeated with variation. This preprocessing decision significantly improved clustering behavior, with similar stems grouping more tightly and the transition matrix showing cleaner structural patterns.

The findings revealed that certain clusters exhibit characteristics typical of natural languages. Cluster 8, for instance, shows high frequency, low diversity, and frequently appears at the beginning of lines—behavior consistent with function words in known languages. Meanwhile, Cluster 3 demonstrates high diversity and flexible positioning, suggesting it may represent content words. Perhaps most tellingly, the transition matrix shows strong internal structure that appears far from random, and cluster usage patterns differ noticeably between manuscript sections (like Biological versus Botanical sections).


Heatmap of cluster transition probabilities, showcasing the linguistic patterns identified in the Voynich Manuscript

Community Suggests Alternative Dimensional Reduction Techniques

While the original analysis used Principal Component Analysis (PCA) for dimensional reduction, community members suggested more advanced alternatives that might reveal deeper structure. Several commenters recommended newer algorithms like UMAP (Uniform Manifold Approximation and Projection), t-SNE, PaCMAP, or LocalMAP as potentially more effective tools for this type of data.

When I get nice separation with PCA, I personally tend to eschew UMAP, since the relative distance of all the points to one another is easier to interpret. I avoid t-SNE at all costs, because distance in those plots are pretty much meaningless.

This discussion highlights an important methodological consideration in embedding visualization: while newer techniques might reveal more complex patterns, they sometimes sacrifice interpretability of relative distances between points. The choice of dimensional reduction technique can significantly impact what patterns researchers observe and how they interpret them.

Outdated Embedding Models and Preprocessing Concerns

Another significant point raised by the community was that the embedding model used in the analysis—paraphrase-multilingual-MiniLM-L12-v2—is approximately four years old, which in the rapidly evolving field of NLP is considered outdated. Commenters suggested that modern text embedding models, even those not explicitly trained for multilingual support, might perform better on unknown languages like that of the Voynich Manuscript.

Additionally, some questioned whether traditional NLP techniques like stripping suffixes might actually harm embedding quality by removing relevant contextual data. The original researcher acknowledged this limitation, noting that the suffix stripping was a strong preprocessing decision that may have removed actual morphological information or disguised meaningful inflectional variants.

The Hoax vs. Language Debate Continues

The community remains divided on whether the Voynich Manuscript represents an actual language or an elaborate hoax. While some believe the manuscript is undecipherable gibberish, the statistical analyses consistently find patterns that would be unlikely to emerge from random text. As one commenter noted, to create such patterns, someone would have had to go a significant way toward building a full constructed language—an impressive feat in itself.

Others pointed out that humans are notoriously bad at generating true randomness, and someone attempting to create a fake language in the 15th century might unintentionally produce text with language-like statistical properties. The debate continues, with some researchers suggesting the manuscript might encode a structured constructed or mnemonic language using syllabic padding and positional repetition.

The application of modern computational techniques to this centuries-old mystery demonstrates how technology can shed new light on historical puzzles. While we may not have cracked the code of the Voynich Manuscript yet, these analyses are helping us understand its structure and narrow down the possibilities of what it might represent.

Reference: Voynich Manuscript Structural Analysis