Bluesky Dictionary Project Reveals Platform Has Only Captured 35% of English Words Despite Processing 4 Million Posts

BigGo Community Team
Bluesky Dictionary Project Reveals Platform Has Only Captured 35% of English Words Despite Processing 4 Million Posts

A fascinating experiment is tracking whether Bluesky users can collectively say every word in the English language. The Bluesky Dictionary project, created by developer Avi Bagla, monitors the platform's real-time posts to see how much of the English dictionary appears in everyday conversations. After processing over 4 million posts, the results show surprising gaps in our digital vocabulary.

Limited Coverage Despite Massive Data Processing

The project has analyzed 51.7 million words from 4.2 million Bluesky posts, yet only covers 35.57% of a standard English dictionary containing 274,937 words. This means nearly two-thirds of English words have never appeared in the analyzed posts. Community members expressed surprise at how common some missing words are, noting that reasonable terms like congregant, definer, and stereoscope haven't been spotted yet.

However, the scope limitation is significant. With Bluesky hosting approximately 1.7 billion total posts according to community data, this project has only examined 0.28% of all messages on the platform. This small sample size might explain why many ordinary words remain uncaptured.

Current Statistics:

  • Dictionary Coverage: 35.57% (97,796 out of 274,937 words)
  • Total Words Processed: 51.7 million
  • Posts Analyzed: 4.2 million
  • Database Size: 58 MB
  • Data Processing Rate: ~900 kbps

Technical Implementation and Real-World Challenges

The backend system uses a straightforward approach with SQLite database tables tracking word statistics and usage patterns. The creator processes Bluesky's data stream at about 900 kilobits per second, storing each unique word with its count and usage timestamps. The 58-megabyte database size demonstrates how efficiently text data can be stored and analyzed.

Several users reported technical difficulties accessing the site, encountering SSL errors and CORS issues. The reliance on client-side JavaScript for displaying results created barriers for users with strict browser security settings or corporate firewalls.

Technical Architecture:

  • Backend: SQLite database with two main tables
  • Data Source: Bluesky Jetstream API (compressed firehose)
  • Word Dictionary: GitHub's "an-array-of-english-words" (274,937 words)
  • Processing: Real-time word tokenization and lookup
  • Storage: Word count, first use, last use, and post reference

Unexpected Discoveries and Gaming the System

The project has captured some truly obscure terms like stigmatophilia, algolagnia, and pyrosomes while missing common words. Some users have begun deliberately posting rare dictionary words to boost the coverage percentage. One user managed a double-combo by using both wheal and sluices in a single post about a Cornwall museum visit.

The system also faces accuracy challenges, such as indexing eluvium when users were discussing the band name rather than the geological term. Language detection issues arise when French posts containing English-looking words get processed, though Bluesky does include language metadata that could help filter results.

This experiment reveals how digital conversations, despite their massive scale, represent only a fraction of human language. Even with millions of posts, our online vocabulary remains surprisingly limited compared to the full richness of English.

Reference: The Bluesky Dictionary