+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 34

[project] Localized dictionary

This is a discussion on [project] Localized dictionary within the Modding & Development forums, part of the Meizu M8 category; Hi, Due the fact that we are not all speaking English as main language, it would be a good idea ...

  1. #1
    Valued Member
    Join Date
    Aug 2008
    Location
    Belgium
    Posts
    2,434
    Thanks
    127
    Thanked 189 Times in 130 Posts

    [project] Localized dictionary

    Hi,

    Due the fact that we are not all speaking English as main language,
    it would be a good idea to be able to import a localized dictionary for easy text input.

    In the firmware 0.9.3.3 topic nice2know_u mentions that there are 2 important files.
    en_1 -> Personal & custom dictionary
    en1 -> Default English dictionary

    In the same topic AlexNesterov informs us that it should not be lots of work to edit the files:
    -> add to the file(s) their name the extension "txt"
    -> open with an text editor and modify to your needs
    -> save the changes and delete the extension "txt"

    Now to import a (allmost)fully localized dictionary, you will first need to find one so you would be able to copy-paste the words instead of typing them yourself.
    For allmost any language there are opensource dictionaries like :
    Ispell, Aspell, Myspell, ...

    If you research the packages for your wanted languages, you will find the community that maintains those files.
    Mostly they store the words in the format "aal·moe·ze·niers·ka·mers"
    So you will need to edit it a bit.

    Example on the Dutch dictionary:
    http://www.ntg.nl/spelling/latin1/woorden.max
    This page contains 222.872 dutch words that are also found in the oOo Myspell.
    -> open an editor (for example notepad ++ plus)
    -> paste the whole word list in it
    -> goto find and replace
    -> by find enter the breaksign between the words : in the example it is "·"
    -> by replace enter nothing so the words can get joined
    -> select find and replace all
    Be a bit patient it could take sometime, if it is done then continu:
    -> copy all words that are in the correct format
    -> open notepad (it is much more performant !!)
    -> paste the whole list of words in it
    -> save the file as xxx.txt somewhere so you have a source/backup file

    For our dutch example it gives this file dutch_dictionary.txt


    Now you will need to format the file correctly to be usable for m8:
    (tx2 Skoddi)
    1. first open the txt with a hex editor (xvi32)
    2. look what are the hex codes between the words; e.g. at my tx it was 0D 0A
    2. Search for 0D AD and replace with
    81 00 00 00 [...it depends how long the longest word ist] 00 82
    (I choose these values because they dont exist in the file before)
    you need the spaces between the words because you will use wildcards and this gurantee that it will not take two words for one
    3. open XVIscript and write

    ADR 0
    JOKERON FF
    FIND 82 FF FF 81
    REPLACE 82 BY 02

    first line: make the script begin at top of the file
    secound line: defines the wildcard
    third line: search for words with 2 letters (each FF is one letter and if you continue u have to insert more FF)
    last line: replace the hex value before the word with the correct prefix


    after that copy the lines
    FIND 82 FF FF 81
    REPLACE 82 BY 02

    and paste them with CRTL+V a while (you can hold the keys down)

    it shoud look like this
    ADR 0
    JOKERON FF
    FIND 82 FF FF 81
    REPLACE 82 BY 02
    FIND 82 FF FF 81
    REPLACE 82 BY 02
    [1000 and more times]
    FIND 82 FF FF 81
    REPLACE 82 BY 02


    then close the script (when it asks to save say no)

    then hit F9 and when its finish without a error msg hit it again until you have the error msg that the script coudnt find the 82 FF FF 81 line

    save the file

    exit hex editor
    start hex editor end reopen the file
    and again you go to script and search now for words with three letters... etc


    ADR 0
    JOKERON FF
    FIND 82 FF FF FF 81
    REPLACE 82 BY 03
    [...]
    -------

    word length 1-9 = prefix 01 - 09
    word length 10-15 = prefix 0A - 0F
    word length 16-25 = prefix 10 - 19
    word length 26-31 = prefix 1A - 1F

    -----------------------------------
    This could be done more automatic with a pascal script:
    (tx2 Crimson05)
    Code:
    program dict;
    
    var
      workfile: Textfile;
      resultfile: Textfile;
      filename: string;
      tmp: string;
    
    begin
      writeln('Please enter File-Path:');
      readln(filename);
      assign(workfile, filename);
      assign(resultfile, 'C:\Temp\en2');
      reset(workfile);
      rewrite(resultfile);
      while not(eof(workfile)) do
      begin
        readln(workfile, tmp);
        tmp := lowercase(tmp);
        writeln(tmp);
        write(resultfile, char(length(tmp))+tmp);
      end;
      close(workfile);
      close(resultfile);
      writeln('All done!');
      readln();
    end.
    The script with instructions:
    http://rapidshare.com/files/281899260/dic.zip
    Again: No exception-handling! If you do wrong the program will close with runtime-error.
    How to use:
    enter the full path (e.g. C:\Temp\german.txt) to a text-file with format:
    word1
    word2
    word3

    en2 will be created in programs folder
    Dutch Language:
    Homepage wordlist: NTG Werkgroep Spelling
    Source file cleaned words : dutch_dictionary.txt
    Meizu M8 dictionary file : RapidShare: 1-CLICK Web hosting - Easy Filehosting
    (tx2 Crimson05)

    German Language:
    Source file cleaned words 1 : RapidShare Webhosting + Webspace
    Source file cleaned words 2 : RapidShare Webhosting + Webspace
    (tx2 Skoddi)
    Meizu M8 dictionary file : http://rapidshare.com/files/281847441/MeizuMe.com_Input_Dic_German.cab
    (tx2 Crimson05)

    French Language:
    Meizu M8 dictionary file : http://rapidshare.com/files/28173828...Dic_French.cab
    (tx2 Crimson05)

    Hungarian Language:
    Meizu M8 dictionary file : RapidShare: 1-CLICK Web hosting - Easy Filehosting
    (tx2 Crimson05)
    Last edited by evow04; 09-20-2009 at 06:15 PM.

  2. 2 members have thanked evow04:


  3. #2
    Senior Member
    Join Date
    Jan 2009
    Posts
    524
    Thanks
    52
    Thanked 131 Times in 46 Posts
    Great idea!! Can´t wait for the german one...

    but i´m not really sure if i´m going to use the default one or touchpal?! actually touchpal seems to be the better one and no one knows what will come with 1.0 and the new ui?!?

    Touchpals T9 is really powerful!!

  4. #3
    Member
    Join Date
    Apr 2009
    Posts
    230
    Thanks
    18
    Thanked 26 Times in 19 Posts
    i did as said and added the .txt file ending to the en1 datafile
    opened it with txt file editor on my windows os

    but by the formation of the words and some ascii symbols it is noticable that that this is not the right format for the file

    so i think to do it right..to change the dictionary to different languages we have to find the right format so we can edit the file correctly

  5. #4
    Valued Member
    Join Date
    Aug 2008
    Location
    Belgium
    Posts
    2,434
    Thanks
    127
    Thanked 189 Times in 130 Posts
    Lilm8 :
    true, that was the part i still needed to test cause i am not sure if i could just add the words to the list.
    Question :
    if you open the en_1 file, does it looks the same format/layout as the en1 file?

    in the meanwhile we could start by hamstering all the wanted dictionaries?

  6. #5
    Freshman
    Join Date
    Dec 2007
    Posts
    32
    Thanks
    3
    Thanked 1 Time in 1 Post
    All words are alphabetized, so it's not as easy to append to it.. The other thing is that it seems they use some kind of system to identify when words are displayed, it might be the ordering system, because when I open the file in Notepad++ I get alot of special characters between the words. I don't know programming or such at this level but someone else might know what these mean;
    Code:
    
    They are displayed as black squares with the labels; ACK, BS, SS, FF,ENQ etc.
    So the problem seems to be how to incorporate this system into our own dictionaries

  7. #6
    Member
    Join Date
    Apr 2009
    Posts
    230
    Thanks
    18
    Thanked 26 Times in 19 Posts
    yes! the the en_1 file has the same format...also crypted

    i tried to find the dict file from touchpal keyboard since this one is very good and have a broad words spectrum..but its hidden in a *.iso file...my guess
    and i couldnt open it...

  8. #7
    Freshman
    Join Date
    Dec 2007
    Posts
    32
    Thanks
    3
    Thanked 1 Time in 1 Post
    In my dictionary files both en1 and en_1 is exactly the same, while en2 has all my added words. None of my added words are in either en_1 or en1. This means it should be quite easy to append after all, as the file is empty from the beginning(en2). Now the only problem is solving the codes, I will have a quick look at it.



    I did some testing.. I grabbed a norwegian dictionary from Open Office. The words were separated by \'s so I just replaced them all with one of the codes from the meizu dictionary(BEL or black dot). Then I removed all the MySpell codes as there were no documentation(that I could find) as to what they meant and how to use them. Luckaly they were all in CAPS so I removed all letters in caps and were left with a looong list of wrds separated by black dots. Saved and copied to mizu;

    Results; Works partially..
    nr1. Some words are recognized( I believe, could be English equivalents though), others appear as groups of words in the
    word suggester.
    nr2. There is a lag of +- 1 second each time i press a letter.

    Now this is ofc due to the fact that I did only use one of the coded.. If we could find the documentation to both MySpell and Meizu, or find an english dictionary in MySpell, we could compare the setup of Meizu -> Myspell, and we might be able to replace the coding in both and have a real dictionary. Another solution is to find some other type than MySpell, but either way we need to know the coding in the files.

    EDIT 2: Seems like no words are recognized might try with a smaller wordlist and different codes. Will try with a scrambled list aswell.

    SOH: new character

    EDIT 3: So this is interesting;
    http://www.cs.tut.fi/~jkorpela/chars/c0.html
    I had a slight idea that this was some programming stuff.. But how does this fit in with the dictionary usage?
    Last edited by stejni; 09-15-2009 at 12:51 PM.

  9. #8
    Member
    Join Date
    Jul 2009
    Location
    Germany
    Posts
    144
    Thanks
    23
    Thanked 9 Times in 7 Posts
    I've created two german dictionaries. i hope it is possible to import it later

    http://rapidshare.de/files/48348971/...y_big.txt.html
    The big one has 197885 words

    http://rapidshare.de/files/48348972/...small.txt.html
    The small one has 70699 words


    Sorry i can't contribute more
    Last edited by Skoddi; 09-15-2009 at 01:33 PM.

  10. #9
    Valued Member
    Join Date
    Aug 2008
    Location
    Belgium
    Posts
    2,434
    Thanks
    127
    Thanked 189 Times in 130 Posts
    Great!
    i am happy that you all liked the idea on localized dictionaries!

    stejni,
    you are on the right track i gues, the characters must have a reason like a breakdown pattern or other.
    Now i am not trusted with those things, so i do not have any clou at the moment but it sound obvious that we will need to find the a logic code after it.

    lag of +- 1 second each time i press a letter
    maybe (just guessing) it could be that your new dictionary is bigger then the old one?

    other type than MySpell
    Browse a bit on the web, i am sure there is a community for your preffered language that has a word index that is always used to import it to Aspell/myspell/..
    If not, you could have a look here The Typethinker: Fun with aspell word lists
    i did not read it all but it seems like a tutorial how you could dump aspell dictionaries (sorry i am wrong)

    smaller wordlist
    Maybe you could empty the current files, then start with a few words (3-5) , maybe this way it will allow you to break the code

    MMMM ascii, damn that is ages ago i used it, i cant recall it anymore what the basics are, but maybe this could set you ontrack :
    ASCII - Wikipedia, the free encyclopedia


    lilM8,
    you can (almost) always extract .iso files :
    google with those words : "iso extractor"

    Skoddi, great!! i am adding it to the first post!

  11. #10
    Member
    Join Date
    Jul 2009
    Location
    Germany
    Posts
    144
    Thanks
    23
    Thanked 9 Times in 7 Posts
    I got the code ! I think XD i will test it after this is wrote :D


    It is very basic. You have to count the letters of the word !

    Have a look at the table at
    ASCII - Wikipedia, the free encyclopedia

    First count the letters than check the number at column with the DEC and than look at the Abbr column. This is what you have to write before the word in the dictionary! There are two exceptions for words with 9 and 10 letter (so far i know) for these words you have to make 2! spaces


    ok, sorry for my bad englisch but i hope you understand what i try to say :P


    now i will try it but i am confident


    edit:

    it works but the thing with the spacec dont work XD because this is not an asciii code XD i check on that


    ---------------------------

    Ok all works

    Instead of using Notepad++ and Ascii codes now i use a hexeditor (XIV32 - freeware) and write before the word the matching hex-code (01,02,03,04,05,06,07,08,09,0A etc) the table contain them too
    Last edited by Skoddi; 09-15-2009 at 11:08 PM.

  12. 2 members have thanked Skoddi:


  13. #11
    Valued Member
    Join Date
    Aug 2008
    Location
    Belgium
    Posts
    2,434
    Thanks
    127
    Thanked 189 Times in 130 Posts
    Skoddi, you are the man

    SO if i understand you correct :
    example word "hello"
    text editor
    5 digits in the word => DEC = 5 => ABR = ENQ
    this makes: "enq hello" as input
    hexeditor
    5 digits in the word => DEC = 5 => HEX = 05
    this makes: "05 hello" as input

    Now the following issue will be to append this on a huge wordlist.
    if it are just a few words it is no big deal but like the dictionary i found with 222.872 dutch words, it is no way to do this one by one.

    MM i am thinking on making a script that counts the letters in a word from a file, then places the hex sign before it.
    But i am not sure if it would work in a .bat file, and to use it in a php script requires a webserver.

    edit :
    if you managed to find the work around for the 9-10 letter words i will add the how to in the first post ;-)
    Last edited by evow04; 09-16-2009 at 08:10 AM.

  14. #12
    Member
    Join Date
    Jul 2009
    Location
    Germany
    Posts
    144
    Thanks
    23
    Thanked 9 Times in 7 Posts
    Quote Originally Posted by evow04 View Post
    Skoddi, you are the man

    SO if i understand you correct :
    example word "hello"
    text editor
    5 digits in the word => DEC = 5 => ABR = ENQ
    this makes: "enq hello" as input
    hexeditor
    5 digits in the word => DEC = 5 => HEX = 05
    this makes: "05 hello" as input

    edit :
    if you managed to find the work around for the 9-10 letter words i will add the how to in the first post ;-)
    in texteditor (Notepad++) you have to use ascii codes so

    "5 digits in the word => DEC = 5 => (ABR= ENQ) OCT = 005
    this makes: "(push ALT and then in NumLock 005)hello" as input

    Hexeditor

    You have to search the hex value (often 0A) before the hello hexcode and change it in this case to 05

    ex:
    before: 0A 68 65 6C 6C 6F
    change to: 05 68 65 6C 6C 6F
    -----------------------------------

    If found there are more of these tricky high numbers so i woud use the hexeditor... i tried to make a script but failed XD i have a plan but dont know how to make it with the script codes from the hexeditor
    Last edited by Skoddi; 09-16-2009 at 08:30 AM.

  15. #13
    Valued Member
    Join Date
    Mar 2009
    Location
    Vienna, Austria
    Posts
    1,706
    Thanks
    54
    Thanked 269 Times in 146 Posts
    just tried this, but failed...

    exchanged HEX 61 (a) with E4 (ä)
    after restart, "a" is gone in the dictionary...

    thats not the correct way,right?!

  16. #14
    Member
    Join Date
    Jul 2009
    Location
    Germany
    Posts
    144
    Thanks
    23
    Thanked 9 Times in 7 Posts
    Quote Originally Posted by rori View Post
    just tried this, but failed...

    exchanged HEX 61 (a) with E4 (ä)
    after restart, "a" is gone in the dictionary...

    thats not the correct way,right?!

    da du aus österreich bist

    du mußt VOR den buchstaben den du im wörterbuch haben willst den etsprechenden code eingeben für einen buchstaben ist es 01 also müßte im hexeditor 01E4 stehen nur keine ahnung ob es mit einzelnen buchstaben denn auch wirklich geht ^^" bin grad dabei das deutsche wörterbuch zu bearbeiten... könnte bis morgen dauern und hoffe dann das es au geht XD

    PLEASE KEEP THE MAIN FORUM IN ENGLISH! READ THE RULES!
    - nice2know_u

  17. #15
    Valued Member
    Join Date
    Mar 2009
    Location
    Vienna, Austria
    Posts
    1,706
    Thanks
    54
    Thanked 269 Times in 146 Posts
    Quote Originally Posted by Skoddi View Post
    da du aus österreich bist

    du mußt VOR den buchstaben den du im wörterbuch haben willst den etsprechenden code eingeben für einen buchstaben ist es 01 also müßte im hexeditor 01E4 stehen nur keine ahnung ob es mit einzelnen buchstaben denn auch wirklich geht ^^" bin grad dabei das deutsche wörterbuch zu bearbeiten... könnte bis morgen dauern und hoffe dann das es au geht XD
    still have to post here in english
    I mean this:


    just exchanged the HEX.....but than I am missing the "a" in dictionary in the device!

  18. #16
    Member
    Join Date
    Jul 2009
    Location
    Germany
    Posts
    144
    Thanks
    23
    Thanked 9 Times in 7 Posts
    yeah :D

    i think because ä isnt an a and like i said i dont know if it works with a single letter, wait for the keyboard from crimson

  19. #17
    Member
    Join Date
    Jul 2009
    Location
    Germany
    Posts
    144
    Thanks
    23
    Thanked 9 Times in 7 Posts
    ive finished the german dictionary but it is to big for the meizu its extreme slow


    edit: and most letters dont work, i think its because these have to many words ^^"


    edit:
    i took a smaller one XD

    and here it is the german dictionary:

    http://rapidshare.de/files/48360947/en2.html
    Last edited by Skoddi; 09-17-2009 at 02:17 AM.

  20. #18
    Valued Member
    Join Date
    Aug 2008
    Location
    Belgium
    Posts
    2,434
    Thanks
    127
    Thanked 189 Times in 130 Posts
    Skoddi,
    how large was the dictionary you had?
    Maybe it could be an idea to split it in 2
    => one part in default dictionary
    => one part in custom dictionary?

    Did you manage to find a work around for the 9-10 letter words?

    most letters dont work
    could you explain some more, i think i did not understand that one.

    Could you inform me how you added the sign key for every word in the file?
    did you do this manually or...?

    exchanged HEX 61 (a) with E4 (ä)
    could be an issue like mentioned in the ""Modding PLUM Keyboard""
    topic : the new test firmware could be missing some (support for) charsets.

  21. #19
    Member
    Join Date
    Jul 2009
    Location
    Germany
    Posts
    144
    Thanks
    23
    Thanked 9 Times in 7 Posts
    Quote Originally Posted by evow04 View Post
    Skoddi,
    how large was the dictionary you had?
    Maybe it could be an idea to split it in 2
    => one part in default dictionary
    => one part in custom dictionary?

    Did you manage to find a work around for the 9-10 letter words?


    could you explain some more, i think i did not understand that one.

    Could you inform me how you added the sign key for every word in the file?
    did you do this manually or...?


    could be an issue like mentioned in the ""Modding PLUM Keyboard""
    topic : the new test firmware could be missing some (support for) charsets.
    nearly 900 kb but the phone was very slow but it coud be that there is an error in there... i will look at the file again when i find time

    @split
    in theory yes but there is one problem that even the new small dictionary dont work as en1 or en_1 these files have some strange code at the beginning that i dont understand yet.

    @9/10 etc words

    like i said when you work with a hex editor this is no problem


    @howto

    half manually half automated

    1. first open the txt with a hex editor (xvi32)
    2. look what are the hex codes between the words; e.g. at my tx it was 0D 0A
    2. Search for 0D AD and replace with
    81 00 00 00 [...it depends how long the longest word ist] 00 82
    (I choose these values because they dont exist in the file before)
    you need the spaces between the words because you will use wildcards and this gurantee that it will not take two words for one
    3. open XVIscript and write

    ADR 0
    JOKERON FF
    FIND 82 FF FF 81
    REPLACE 82 BY 02

    first line: make the script begin at top of the file
    secound line: defines the wildcard
    third line: search for words with 2 letters (each FF is one letter and if you continue u have to insert more FF)
    last line: replace the hex value before the word with the correct prefix


    after that copy the lines
    FIND 82 FF FF 81
    REPLACE 82 BY 02

    and paste them with CRTL+V a while (you can hold the keys down)

    it shoud look like this
    ADR 0
    JOKERON FF
    FIND 82 FF FF 81
    REPLACE 82 BY 02
    FIND 82 FF FF 81
    REPLACE 82 BY 02
    [1000 and more times]
    FIND 82 FF FF 81
    REPLACE 82 BY 02


    then close the script (when it asks to save say no)

    then hit F9 and when its finish without a error msg hit it again until you have the error msg that the script coudnt find the 82 FF FF 81 line

    save the file

    exit hex editor
    start hex editor end reopen the file
    and again you go to script and search now for words with three letters... etc


    ADR 0
    JOKERON FF
    FIND 82 FF FF FF 81
    REPLACE 82 BY 03
    [...]
    -------

    word length 1-9 = prefix 01 - 09
    word length 10-15 = prefix 0A - 0F
    word length 16-25 = prefix 10 - 19
    word length 26-31 = prefix 1A - 1F

    -----------------------------------

    @ the a ä problem: it has nothing to do with missing char set becuse words with ä within don't have problems... its because a single ö , ä , ü are handled like words and without these chars on the keyboard you cannot use these as a start letter
    Last edited by Skoddi; 09-17-2009 at 11:05 AM.

  22. #20
    Moderator
    Join Date
    Mar 2008
    Location
    Germany
    Posts
    1,680
    Thanks
    89
    Thanked 356 Times in 139 Posts
    i wrote a short script to create the dictionaries. here my results:

    MeizuMe.com_Input_Dic_Dutch.cab

    MeizuMe.com_Input_Dic_French.cab

    MeizuMe.com_Input_Dic_German.cab

    MeizuMe.com_Input_Dic_Hungarian.cab

    it's with english

    all have 10000 words except hungarian

    the words are with üöäß etc. but you cannot enter a new word with these special chars


    here my pascal-script (without exception-handling etc.):

    Code:
    program dict;
    
    var
      workfile: Textfile;
      resultfile: Textfile;
      filename: string;
      tmp: string;
    
    begin
      writeln('Please enter File-Path:');
      readln(filename);
      assign(workfile, filename);
      assign(resultfile, 'C:\Temp\en2');
      reset(workfile);
      rewrite(resultfile);
      while not(eof(workfile)) do
      begin
        readln(workfile, tmp);
        tmp := lowercase(tmp);
        writeln(tmp);
        write(resultfile, char(length(tmp))+tmp);
      end;
      close(workfile);
      close(resultfile);
      writeln('All done!');
      readln();
    end.
    Last edited by crimson05; 09-18-2009 at 04:52 PM.
    [Meizu M8 - FAQ] [Meizu M9 Tips and Tricks]

    No Support over PM. Please use the Forum.

  23. 4 members have thanked crimson05:



 

Similar Threads

  1. Rockbox Project (M6)
    By vcf in forum Rockbox
    Replies: 420
    Last Post: 07-20-2011, 09:29 AM
  2. [project] Windows CE 6
    By evow04 in forum Modding & Development
    Replies: 29
    Last Post: 01-11-2011, 07:49 PM
  3. MDict dictionary for M8
    By Exfrimenta in forum M8 Applications
    Replies: 38
    Last Post: 03-26-2010, 01:41 PM
  4. Replies: 0
    Last Post: 04-02-2009, 10:56 PM
  5. Change or edit dictionary file
    By MeisterLampe1 in forum General Meizu M8
    Replies: 2
    Last Post: 03-26-2009, 12:19 PM