Data Resources

  • ယၢမ်းလဵဝ် ၶၵ်ႉၵၢၼ်ၾၢႆႇ NLP တေၸႂ်ႉတိုဝ်း Machine Learning လႄႈ Deep Learning ပဵၼ်လၵ်း၊ ၸိူဝ်းၼႆႉလူဝ်ႇတၢင်းၼမ်ၶေႃႈမုၼ်းယႂ်ႇလူင်၊ ၵႂၢင်ႈၶႂၢင်၊ လႄႈထုၵ်ႇမႅၼ်ႈ။
  • ၼင်ၵႃႈလဵပ်ႈႁဵၼ်းလႄႈၶူၼ်ႉၶႂႃႉႁႃလိၵ်ႈတႆး ဢၼ်ပိုၼ်ၽႄႈယူႇၼိူဝ် online ၼၼ်ႉ (in digital format) လႆႈၶေႃႈမၢႆဝႃႈ
    • လိၵ်ႈတႆးဢၼ်ပိုၼ်ၽႄႈ လႄႈၸႅတ်ႈထွင်တူဝ်ၽိတ်းတူဝ်ထုၵ်ႇ ယုမ်ႇယမ်လႆႈၼၼ်ႉမီးဢေႇႁႅင်း လႄႈဢေႇလူင်းၵူႈမိုဝ်ႉၵူႈဝၼ်း။
    • ဢၼ်ပိုၼ်ၽႄႈယူႇၼၼ်ႉၸႂ်ႉ Standardization ဢမ်ႇမိူၼ်ၵၼ် မိူၼ်ၼင်ႇ လၢႆးပေႃႉလေႃးလိၵ်ႈ၊ ၵၢၼ်ၸႂ်ႉယၵ်းၶိုၼ်ႈ ၸိူဝ်းၼႆႉ။
  • ၶေႃႈမုၼ်းဢၼ်ၸၼ်ၸႂ်ႉလႆႈၵမ်းလဵဝ်ၼၼ်ႉ မီးဢမ်ႇထိုင် 500MB လႄႈႁူဝ်ၶေႃႈဢၼ်ပိုၼ်ၽႄႈဝႆႉၸိူဝ်းၼၼ်ႉ ဢမ်ႇပႆႇၵႂၢင်ႈၶႂၢင်ပဵင်းပေႃး။

Language Data Resource

Shan Digital Datasets:

  • Shan Wikipedia ~11MB
  • shannews.org ~66MB
  • taifreedom.com ~46MB
  • ssppssa.org ~2.8MB
  • shan.shanhumanrights.org ~2.6MB

Thematic Word Lists

  • ၶေႃႈၵႂၢမ်းၸႂ်ႉတိုဝ်းၵူႈမိုဝ်ႉၵူႈဝၼ်းၼႆႉ ၸဵမ်လွၵ်းလၢႆးလေႃးလိၵ်ႈ လႄႈထွႆႈၵႂၢမ်း ဢမ်ႇပႆႇၵႂၢင်ႈၶႂၢင်၊ မိူၼ်ၼင်ႇ လွင်ႈၸူဝ်ႈပၢၼ်ႁဝ်းၵူႈဝၼ်းဝၼ်း၊ လွင်ႈသၽႃႇဝ၊ လွင်ႈထဵၵ်ႉၶ်ၼူဝ်ႇလူဝ်ႇၸီႇ ၸိူဝ်းၼႆႉ။
  • မိူၼ်ၼင်ႇ Technology တေလႆႈလေႃးလိၵ်ႈၸိူင်ႉႁိုဝ် -> ထဵၵ်ႇၶ်ၼူဝ်ႇလူဝ်ႇၸီႇထဵၶ်ၼေႃႇလူဝ်ႇယီႇ လႄႈၸၢင်ႈလေႃးၽႂ်လေႃးမၼ်းထႅင်ႈတၢင်းၼမ်။
  • တႃႇတေပိၼ်ႇၽႃႇသႃႇ datasets ၼႂ်းလိၵ်ႈတၢင်ႇၶိူဝ်းၸိူဝ်းၼႆႉမီးလွင်ႈယၢပ်ႇၽိုတ်ႇတႃႇလေႃးလိၵ်ႈ လႄႈ ႁႃႈၶေႃႈၸႂ်ႉတိုဝ်းဢမ်ႇလႆႈ။

How Much Data we are talking about

ၶေႃႈမုၼ်းဢၼ် ႁဝ်းၶႃႈ လၢတ်ႈယူႇၼၼ်ႉတေမီးၵႃႈႁိုဝ် ?

  • EX. GPT-3 is trained on about 45TB of text data and 570GB to fine-tune ChatGPT
  • တွၼ်ႈတႃႇလိၵ်ႈတႆးႁဝ်း လူဝ်ႇၶေႃႈမုၼ်းၼမ်ၼင်ႇၵႃႈႁႃလႆႈ လႄႈတေလႆႈပဵၼ်ၶေႃႈမုၼ်းၸိူင်ႉႁိုဝ်ၼၼ်ႉ တေဢိင်ဢဝ်ၶၵ်ႉၵၢၼ် NLP tasks သေသပ်းလႅင်းၼႄ။
LanguageNumber of DocumentsPercentage of total Documents
en (လိၵ်ႈဢင်းၵိတ်ႉ)23598742093.68882%
de (လိၵ်ႈၵျႃႇမၼီႇ)30145971.19682%
fr ( လိၵ်ႈၾရႅၼ်ႉၶျ် - French )25683411.01965%
th ( လိၵ်ႈထႆး )413010.01640%
my (လိၵ်ႈမၢၼ်ႈ)21470.00085%

GPT-3 Dataset Statistics: Source Link

NLP (Natural Language Processing) Tasks

  • Tokenization

  • POS (Part-of-Speech Tagging)

  • Named Entity Recognition (NER)

  • Sentiment Analysis

  • Text Classification

  • Dependency Parsing

  • Coreference Resolution

  • Machine Translation

  • Text Summarization

  • Question Answering

  • Text Generation

  • Speech Recognition

  • Text Clustering

  • Text Similarity

  • Topic Modeling

  • Semantic Role Labeling (SRL)

  • Constituency Parsing

  • Lexical Semantics

  • Word Sense Disambiguation (WSD)

  • Language Modeling (LM)

Datasets Development to feed NLP Task

  • Tokenization:
    • Dictionary base, Rules base: Trusted and Cover all Shan words dictionary. (known problem: out of vocab)
    • ML, DL base: any text data cover all Shan words. (known problem: large datasets need)
  • Machine Translation
    • Large Language Pair dataset in English-Shan, Myanmar-Shan or Thai-Shan.
  • Speech Recogition
    • Clear Voice and Transcript dataset.
  • Language Modeling
    • Generative field: Large text story cover a broad range of topics, including daily life, nature, technology, etc.
    • Question Answering field (Prompt base like ChatGPT): Widely range QA Dataset.

Need of Technical in Development

  • Dataset Development
    • Tokenization: Dictionary and Text data collection and validate.
    • Machine Translation: Language Pair data collection and validate.
    • Speech Recognition: Voice speaker, collect, transcript and validate.
    • LLM (Language Modeling): Text and Story collection.
  • Tech related Development
    • Database, Data collector (ex. Web base to crown sourcing data collection)
    • Data preprocessing