TaigiLex

Objectives overview

  1. Basic toolbox for online text editing in Taiwanese, to allow (2) (more below)
  2. an editable version of 萌典, to allow (3)
  3. The crowd-compilation of a Taiwanese dictionary in Taiwanese, post-reviewed by a small team
  4. the description of various lexical relations between the entries in the lexicon to allow more intelligent text-processor. 

(English version and Mandarin version may not totally in sync.)

(目前英文及華文版說明尚未同步)

目標簡介

  1. 發展可在線上編輯多重書寫系統台語文的書寫工具。
  2. 發展可以協同編輯的萌典。
  3. 發展群眾編輯、專家審議的開放台語詞典。
  4. 增修台語詞典詞條與詞條之間關係的描述,以幫助開發更智慧的台語文書寫工具。

Detailed description

 The Toolbox

 

 What should it include:

This shall be done as independent libraries that will be release in Open Source. 

A platform for the crowd compilation of a dictionary.

Those libraries will be used by an interface to have the dictionary edited by the netizens. (Probably integrated or directly linked with 萌典, to be discussed maybe forking or adding modules to etherpad would be better...)

Supervision of the compilation process

A small team shall follow the edition of the dictionary (for selection/correction)

The dictionary will be released under C.C. licence, made available through 萌典 (and downloadable as epub ?)

Towards a lexical graph

Time should be spent to describe the relations between the entries of the dictionary (synonyms, antonyms, metonyms,... ) to build a machine-readable lexicon to help and promote Taiwanese Text Processing

詳細說明

台語文書寫工具

協同編輯詞典的平台

語意網(?)

What we need

 

a small team of 2 to 4 pple for 6 to 12 month ?

I think it’s reasonable for the coding part  (1 person)

one pb would be that the lexicographic work can only start once the code is working.

More precisely: 

conclusion: ask for at least 3man×month budget for the tools and base dataset, then another 2~3month to setup and start crowdsourcing + any amount of $ for taiwanese lexicographers, the more the better.

Discussion

User scenario

Related resource and problems

Online dictionary

  • 教育部
  • 信望愛

  • 中研院
  • 楊允言
  • Wikipedia
  • 李勤岸

    Imput Methods

    Goals/Dreams (To be sorted into short/mid/long term goals later)

    Current status

    Case 0: Taiwanes Orthography

    Case 1: From FHL-IME to FHL-DIctioanry

    From pektiong=pcchen: Currently, the 信望愛台語料庫 uses LIFT as data format, uses WeSay as the client to edit and to sync with the hg server (bitbucket). lift-convert is then used to convert the entry (in the formate of "lomaji hanji") to csv, then be converted into database for the FHL-IME.

    In principle multiple people can use WeSay to edit the same data, when WeSay sync with the hg server, it will perform auto-merge. Unfortunately, most of the time I am the only user. On can also edit the LIFT XML directoy, then use WeSay to sync to the hg server.

    WeSay is slow and has only stable version on windows for now. WeSay also comes to some tools to convert the LIFT into HTML/PDF/Word but they don’w work well with large XML (such as 信望愛台語料庫). LIFT is designed for storing lexical information, so in principle it is a good choice for the data format for the Taiwanese dictionary. For a long time I cannot find good dictionary software so I didn’t try tro use WeSay/LIFT to build a Taiwanese dictionary. Currently it only has many entries in the formate like "jiû-hî 鰇魚" to be used for IME. 

    We can also consider LMF (http://www.lexicalmarkupframework.org/ ,  http://en.wikipedia.org/wiki/Lexical_Markup_Framework )

    With the birth of the 萌典, it becomes more likely that one can build a Taiwanese dictionary by (1) create/aggregate dictionary data (2) convert the data to 萌典 format. This is something I am testing now. The idea is to

    Links

    Case 2: Wikipedia and Digital Archive