eimiAI_Paucinhau (Zotuallai) development report 2

eimiAI_Paucinhau Model Development: Work Progress Report
Burmese version
Date:
November 30, 2025
Subject:
Current Transliteration Performance & Strategic Roadmap for Optimization
I. Executive Summary
The eimiAI_Paucinhau model is currently transitioning from a phonetic transliteration engine to a context-aware linguistic engine. Based on a detailed analysis of Tedim Bible Genesis 1 through 10, the model demonstrates an overall match rate of 45%–55% compared to human-corrected benchmarks.
Current Achievement
Successfully handles basic root vocabulary
Remaining Challenges
Significant error margins due to complex grammatical tone rules (Sandhi) and lack of standardized spelling data for proper nouns
Strategic Value
High error rate is a critical asset for "Data Collection Phase"
The current high error rate is a critical asset for the "Data Collection Phase," allowing us to build a robust "Golden Dataset" for Reinforcement Learning from Human Feedback (RLHF).
II. Performance Metrics
Overall Accuracy
45% – 55%
Status: Developing
Common Narrative (Gen 6–9)
60% – 70%
Status: Promising
Genealogy/Proper Nouns (Gen 10)
< 15%
Status: Critical Attention Needed
III. Detailed Error Analysis
The discrepancy between AI output and human correction allows us to isolate three primary failure modes:
01
The "Word-Isolation" Problem (Context Blindness)
Issue: The AI currently treats words as isolated units rather than parts of a sentence structure.
Impact (20% of errors): It fails to apply Sandhi Tones. For example, the subject marker in changes tone based on whether it marks an actor or instrument. Without syntactic awareness, the AI defaults to a generic glyph, missing the required tonal flow (e.g., Mihing vs. Mihing-sandhi).
02
The "Dictionary Deficit" (Proper Nouns)
Issue: The AI lacks a dedicated lexicon for proper nouns and attempts to "sound out" names letter-by-letter.
Impact (Genealogy Failure): Biblical names in Pau Cin Hau often use fixed, logographic spellings (e.g., Noah = 𑫐𑫘𑫥𑫕𑫧) that do not follow standard phonetic rules. This results in "hallucinated" spellings for names like Arpakshad or Joktan.
03
Consonant Class Disambiguation
Issue: Romanized Tedim uses characters like 'B', 'P', 'D', and 'T' which map to specific "Classes" in Pau Cin Hau (e.g., the 'Ka' class vs. the 'Pa' class).
Impact (15% of errors): The AI frequently selects the wrong consonant class (e.g., spelling bawl with a KA glyph instead of BA), altering the word's meaning entirely.
IV. Strategic Next Steps
To evolve the model from a 50% phonetic engine to a 95%+ linguistic expert, we will execute the following protocol:
Immediate Action: Data Collection & Calibration
Process Mathew 1-10
We must process the next chapter to capture more narrative flow. This is essential for training the AI on sentence-level tone rules (Sandhi) that cannot be learned from lists of names.
Build the "Golden Dataset"
Every correction made during this phase is being logged into a structured JSON file. This file will serve as the "Ground Truth" for the model.
Required Developer Actions (High Priority)
To achieve the best outcomes, the following inputs are required from the developer:
1
Upload the "Cipher Key"
A screenshot of the spelling drill tables (e.g., Page 16 or 46 from BAOL 2) is needed. This will allow us to decode the font encoding in the provided textbooks and instantly ingest thousands of validated words into the dictionary.
2
Proper Noun Injection
Provide a CSV or text list mapping Romanized Biblical names to their standardized Pau Cin Hau script (e.g., Noah | 𑫐𑫘𑫥𑫕𑫧). This will immediately resolve the <15% accuracy rate in genealogy sections.
3
Style Guide Decision
A decision is needed on whether modern loanwords (e.g., "Facebook", "YouTube") should be transliterated phonetically or kept in Latin script, as seen in the PAWL 3 textbook.
V. Conclusion
The current error rate is expected and necessary. By identifying these specific failure points now, we are creating the precise training data needed to fine-tune the model.
Current State
45%–55% overall accuracy
Three identified failure modes
Active data collection phase
Next Iteration
Integration of "Golden Dataset"
Decoding of textbooks
Anticipated significant accuracy jump
With the integration of the "Golden Dataset" and the decoding of the textbooks, we anticipate a significant jump in accuracy in the next iteration.