+ Reply to Thread
Results 1 to 20 of 34
[project] Localized dictionary
This is a discussion on [project] Localized dictionary within the Modding & Development forums, part of the Meizu M8 category; Hi, Due the fact that we are not all speaking English as main language, it would be a good idea ...
-
09-15-2009 #1Valued Member
- Join Date
- Aug 2008
- Location
- Belgium
- Posts
- 2,434
- Thanks
- 127
Thanked 189 Times in 130 Posts[project] Localized dictionary
Hi,
Due the fact that we are not all speaking English as main language,
it would be a good idea to be able to import a localized dictionary for easy text input.
In the firmware 0.9.3.3 topic nice2know_u mentions that there are 2 important files.
en_1 -> Personal & custom dictionary
en1 -> Default English dictionary
In the same topic AlexNesterov informs us that it should not be lots of work to edit the files:
-> add to the file(s) their name the extension "txt"
-> open with an text editor and modify to your needs
-> save the changes and delete the extension "txt"
Now to import a (allmost)fully localized dictionary, you will first need to find one so you would be able to copy-paste the words instead of typing them yourself.
For allmost any language there are opensource dictionaries like :
Ispell, Aspell, Myspell, ...
If you research the packages for your wanted languages, you will find the community that maintains those files.
Mostly they store the words in the format "aal·moe·ze·niers·ka·mers"
So you will need to edit it a bit.
Example on the Dutch dictionary:
http://www.ntg.nl/spelling/latin1/woorden.max
This page contains 222.872 dutch words that are also found in the oOo Myspell.
-> open an editor (for example notepad ++ plus)
-> paste the whole word list in it
-> goto find and replace
-> by find enter the breaksign between the words : in the example it is "·"
-> by replace enter nothing so the words can get joined
-> select find and replace all
Be a bit patient it could take sometime, if it is done then continu:
-> copy all words that are in the correct format
-> open notepad (it is much more performant !!)
-> paste the whole list of words in it
-> save the file as xxx.txt somewhere so you have a source/backup file
For our dutch example it gives this file dutch_dictionary.txt
Now you will need to format the file correctly to be usable for m8:
(tx2 Skoddi)
This could be done more automatic with a pascal script:1. first open the txt with a hex editor (xvi32)
2. look what are the hex codes between the words; e.g. at my tx it was 0D 0A
2. Search for 0D AD and replace with
81 00 00 00 [...it depends how long the longest word ist] 00 82
(I choose these values because they dont exist in the file before)
you need the spaces between the words because you will use wildcards and this gurantee that it will not take two words for one
3. open XVIscript and write
ADR 0
JOKERON FF
FIND 82 FF FF 81
REPLACE 82 BY 02
first line: make the script begin at top of the file
secound line: defines the wildcard
third line: search for words with 2 letters (each FF is one letter and if you continue u have to insert more FF)
last line: replace the hex value before the word with the correct prefix
after that copy the lines
FIND 82 FF FF 81
REPLACE 82 BY 02
and paste them with CRTL+V a while (you can hold the keys down)
it shoud look like this
ADR 0
JOKERON FF
FIND 82 FF FF 81
REPLACE 82 BY 02
FIND 82 FF FF 81
REPLACE 82 BY 02
[1000 and more times]
FIND 82 FF FF 81
REPLACE 82 BY 02
then close the script (when it asks to save say no)
then hit F9 and when its finish without a error msg hit it again until you have the error msg that the script coudnt find the 82 FF FF 81 line
save the file
exit hex editor
start hex editor end reopen the file
and again you go to script and search now for words with three letters... etc
ADR 0
JOKERON FF
FIND 82 FF FF FF 81
REPLACE 82 BY 03
[...]
-------
word length 1-9 = prefix 01 - 09
word length 10-15 = prefix 0A - 0F
word length 16-25 = prefix 10 - 19
word length 26-31 = prefix 1A - 1F
-----------------------------------
(tx2 Crimson05)
The script with instructions:Code:program dict; var workfile: Textfile; resultfile: Textfile; filename: string; tmp: string; begin writeln('Please enter File-Path:'); readln(filename); assign(workfile, filename); assign(resultfile, 'C:\Temp\en2'); reset(workfile); rewrite(resultfile); while not(eof(workfile)) do begin readln(workfile, tmp); tmp := lowercase(tmp); writeln(tmp); write(resultfile, char(length(tmp))+tmp); end; close(workfile); close(resultfile); writeln('All done!'); readln(); end.
Dutch Language:http://rapidshare.com/files/281899260/dic.zip
Again: No exception-handling! If you do wrong the program will close with runtime-error.
How to use:
enter the full path (e.g. C:\Temp\german.txt) to a text-file with format:
word1
word2
word3
en2 will be created in programs folder
Homepage wordlist: NTG Werkgroep Spelling
Source file cleaned words : dutch_dictionary.txt
Meizu M8 dictionary file : RapidShare: 1-CLICK Web hosting - Easy Filehosting
(tx2 Crimson05)
German Language:
Source file cleaned words 1 : RapidShare Webhosting + Webspace
Source file cleaned words 2 : RapidShare Webhosting + Webspace
(tx2 Skoddi)
Meizu M8 dictionary file : http://rapidshare.com/files/281847441/MeizuMe.com_Input_Dic_German.cab
(tx2 Crimson05)
French Language:
Meizu M8 dictionary file : http://rapidshare.com/files/28173828...Dic_French.cab
(tx2 Crimson05)
Hungarian Language:
Meizu M8 dictionary file : RapidShare: 1-CLICK Web hosting - Easy Filehosting
(tx2 Crimson05)Last edited by evow04; 09-20-2009 at 06:15 PM.
-
-
09-15-2009 #2
Great idea!! Can´t wait for the german one...
but i´m not really sure if i´m going to use the default one or touchpal?! actually touchpal seems to be the better one and no one knows what will come with 1.0 and the new ui?!?
Touchpals T9 is really powerful!!
-
09-15-2009 #3Member
- Join Date
- Apr 2009
- Posts
- 230
- Thanks
- 18
Thanked 26 Times in 19 Postsi did as said and added the .txt file ending to the en1 datafile
opened it with txt file editor on my windows os
but by the formation of the words and some ascii symbols it is noticable that that this is not the right format for the file
so i think to do it right..to change the dictionary to different languages we have to find the right format so we can edit the file correctly
-
09-15-2009 #4Valued Member
- Join Date
- Aug 2008
- Location
- Belgium
- Posts
- 2,434
- Thanks
- 127
Thanked 189 Times in 130 PostsLilm8 :
true, that was the part i still needed to test cause i am not sure if i could just add the words to the list.
Question :
if you open the en_1 file, does it looks the same format/layout as the en1 file?
in the meanwhile we could start by hamstering all the wanted dictionaries?
-
09-15-2009 #5Freshman
- Join Date
- Dec 2007
- Posts
- 32
- Thanks
- 3
Thanked 1 Time in 1 PostAll words are alphabetized, so it's not as easy to append to it.. The other thing is that it seems they use some kind of system to identify when words are displayed, it might be the ordering system, because when I open the file in Notepad++ I get alot of special characters between the words. I don't know programming or such at this level but someone else might know what these mean;
They are displayed as black squares with the labels; ACK, BS, SS, FF,ENQ etc.Code:
So the problem seems to be how to incorporate this system into our own dictionaries
-
09-15-2009 #6Member
- Join Date
- Apr 2009
- Posts
- 230
- Thanks
- 18
Thanked 26 Times in 19 Postsyes! the the en_1 file has the same format...also crypted
i tried to find the dict file from touchpal keyboard since this one is very good and have a broad words spectrum..but its hidden in a *.iso file...my guess
and i couldnt open it...
-
09-15-2009 #7Freshman
- Join Date
- Dec 2007
- Posts
- 32
- Thanks
- 3
Thanked 1 Time in 1 PostIn my dictionary files both en1 and en_1 is exactly the same, while en2 has all my added words. None of my added words are in either en_1 or en1. This means it should be quite easy to append after all, as the file is empty from the beginning(en2). Now the only problem is solving the codes, I will have a quick look at it.
I did some testing.. I grabbed a norwegian dictionary from Open Office. The words were separated by \'s so I just replaced them all with one of the codes from the meizu dictionary(BEL or black dot). Then I removed all the MySpell codes as there were no documentation(that I could find) as to what they meant and how to use them. Luckaly they were all in CAPS so I removed all letters in caps and were left with a looong list of wrds separated by black dots. Saved and copied to mizu;
Results; Works partially..
nr1. Some words are recognized( I believe, could be English equivalents though), others appear as groups of words in the
word suggester.
nr2. There is a lag of +- 1 second each time i press a letter.
Now this is ofc due to the fact that I did only use one of the coded.. If we could find the documentation to both MySpell and Meizu, or find an english dictionary in MySpell, we could compare the setup of Meizu -> Myspell, and we might be able to replace the coding in both and have a real dictionary. Another solution is to find some other type than MySpell, but either way we need to know the coding in the files.
EDIT 2: Seems like no words are recognized might try with a smaller wordlist and different codes. Will try with a scrambled list aswell.
SOH: new character
EDIT 3: So this is interesting;
http://www.cs.tut.fi/~jkorpela/chars/c0.html
I had a slight idea that this was some programming stuff.. But how does this fit in with the dictionary usage?Last edited by stejni; 09-15-2009 at 12:51 PM.
-
09-15-2009 #8Member
- Join Date
- Jul 2009
- Location
- Germany
- Posts
- 144
- Thanks
- 23
Thanked 9 Times in 7 PostsI've created two german dictionaries. i hope it is possible to import it later
http://rapidshare.de/files/48348971/...y_big.txt.html
The big one has 197885 words
http://rapidshare.de/files/48348972/...small.txt.html
The small one has 70699 words
Sorry i can't contribute moreLast edited by Skoddi; 09-15-2009 at 01:33 PM.
-
09-15-2009 #9Valued Member
- Join Date
- Aug 2008
- Location
- Belgium
- Posts
- 2,434
- Thanks
- 127
Thanked 189 Times in 130 PostsGreat!
i am happy that you all liked the idea on localized dictionaries!
stejni,
you are on the right track i gues, the characters must have a reason like a breakdown pattern or other.
Now i am not trusted with those things, so i do not have any clou at the moment but it sound obvious that we will need to find the a logic code after it.
maybe (just guessing) it could be that your new dictionary is bigger then the old one?lag of +- 1 second each time i press a letter
Browse a bit on the web, i am sure there is a community for your preffered language that has a word index that is always used to import it to Aspell/myspell/..other type than MySpell
If not, you could have a look here The Typethinker: Fun with aspell word lists
i did not read it all but it seems like a tutorial how you could dump aspell dictionaries (sorry i am wrong)
Maybe you could empty the current files, then start with a few words (3-5) , maybe this way it will allow you to break the codesmaller wordlist
MMMM ascii, damn that is ages ago i used it, i cant recall it anymore what the basics are, but maybe this could set you ontrack :So this is interesting;
http://www.cs.tut.fi/~jkorpela/chars/c0.html
ASCII - Wikipedia, the free encyclopedia
lilM8,
you can (almost) always extract .iso files :
google with those words : "iso extractor"
Skoddi, great!! i am adding it to the first post!
-
09-15-2009 #10Member
- Join Date
- Jul 2009
- Location
- Germany
- Posts
- 144
- Thanks
- 23
Thanked 9 Times in 7 PostsI got the code ! I think XD i will test it after this is wrote :D
It is very basic. You have to count the letters of the word !
Have a look at the table at
ASCII - Wikipedia, the free encyclopedia
First count the letters than check the number at column with the DEC and than look at the Abbr column. This is what you have to write before the word in the dictionary! There are two exceptions for words with 9 and 10 letter (so far i know) for these words you have to make 2! spaces
ok, sorry for my bad englisch but i hope you understand what i try to say :P
now i will try it but i am confident
edit:
it works but the thing with the spacec dont work XD because this is not an asciii code XD i check on that
---------------------------
Ok all works
Instead of using Notepad++ and Ascii codes now i use a hexeditor (XIV32 - freeware) and write before the word the matching hex-code (01,02,03,04,05,06,07,08,09,0A etc) the table contain them tooLast edited by Skoddi; 09-15-2009 at 11:08 PM.
-
-
09-16-2009 #11Valued Member
- Join Date
- Aug 2008
- Location
- Belgium
- Posts
- 2,434
- Thanks
- 127
Thanked 189 Times in 130 PostsSkoddi, you are the man


SO if i understand you correct :
example word "hello"
text editor
5 digits in the word => DEC = 5 => ABR = ENQ
this makes: "enq hello" as input
hexeditor
5 digits in the word => DEC = 5 => HEX = 05
this makes: "05 hello" as input
Now the following issue will be to append this on a huge wordlist.
if it are just a few words it is no big deal but like the dictionary i found with 222.872 dutch words, it is no way to do this one by one.
MM i am thinking on making a script that counts the letters in a word from a file, then places the hex sign before it.
But i am not sure if it would work in a .bat file, and to use it in a php script requires a webserver.
edit :
if you managed to find the work around for the 9-10 letter words i will add the how to in the first post ;-)Last edited by evow04; 09-16-2009 at 08:10 AM.
-
09-16-2009 #12Member
- Join Date
- Jul 2009
- Location
- Germany
- Posts
- 144
- Thanks
- 23
Thanked 9 Times in 7 Postsin texteditor (Notepad++) you have to use ascii codes so
"5 digits in the word => DEC = 5 => (ABR= ENQ) OCT = 005
this makes: "(push ALT and then in NumLock 005)hello" as input
Hexeditor
You have to search the hex value (often 0A) before the hello hexcode and change it in this case to 05
ex:
before: 0A 68 65 6C 6C 6F
change to: 05 68 65 6C 6C 6F
-----------------------------------
If found there are more of these tricky high numbers so i woud use the hexeditor... i tried to make a script but failed XD i have a plan but dont know how to make it with the script codes from the hexeditorLast edited by Skoddi; 09-16-2009 at 08:30 AM.
-
09-16-2009 #13Valued Member
- Join Date
- Mar 2009
- Location
- Vienna, Austria
- Posts
- 1,706
- Thanks
- 54
Thanked 269 Times in 146 Postsjust tried this, but failed...
exchanged HEX 61 (a) with E4 (ä)
after restart, "a" is gone in the dictionary...
thats not the correct way,right?!
-
09-16-2009 #14Member
- Join Date
- Jul 2009
- Location
- Germany
- Posts
- 144
- Thanks
- 23
Thanked 9 Times in 7 Posts
da du aus österreich bist
du mußt VOR den buchstaben den du im wörterbuch haben willst den etsprechenden code eingeben für einen buchstaben ist es 01 also müßte im hexeditor 01E4 stehen
nur keine ahnung ob es mit einzelnen buchstaben denn auch wirklich geht ^^" bin grad dabei das deutsche wörterbuch zu bearbeiten... könnte bis morgen dauern und hoffe dann das es au geht XD
PLEASE KEEP THE MAIN FORUM IN ENGLISH! READ THE RULES!
- nice2know_u
-
09-16-2009 #15Valued Member
- Join Date
- Mar 2009
- Location
- Vienna, Austria
- Posts
- 1,706
- Thanks
- 54
Thanked 269 Times in 146 Posts
-
09-16-2009 #16Member
- Join Date
- Jul 2009
- Location
- Germany
- Posts
- 144
- Thanks
- 23
Thanked 9 Times in 7 Postsyeah :D
i think because ä isnt an a and like i said i dont know if it works with a single letter, wait for the keyboard from crimson
-
09-16-2009 #17Member
- Join Date
- Jul 2009
- Location
- Germany
- Posts
- 144
- Thanks
- 23
Thanked 9 Times in 7 Postsive finished the german dictionary but it is to big for the meizu
its extreme slow
edit: and most letters dont work, i think its because these have to many words ^^"
edit:
i took a smaller one XD
and here it is the german dictionary:
http://rapidshare.de/files/48360947/en2.htmlLast edited by Skoddi; 09-17-2009 at 02:17 AM.
-
09-17-2009 #18Valued Member
- Join Date
- Aug 2008
- Location
- Belgium
- Posts
- 2,434
- Thanks
- 127
Thanked 189 Times in 130 PostsSkoddi,
how large was the dictionary you had?
Maybe it could be an idea to split it in 2
=> one part in default dictionary
=> one part in custom dictionary?
Did you manage to find a work around for the 9-10 letter words?
could you explain some more, i think i did not understand that one.most letters dont work
Could you inform me how you added the sign key for every word in the file?
did you do this manually or...?
could be an issue like mentioned in the ""Modding PLUM Keyboard""exchanged HEX 61 (a) with E4 (ä)
topic : the new test firmware could be missing some (support for) charsets.
-
09-17-2009 #19Member
- Join Date
- Jul 2009
- Location
- Germany
- Posts
- 144
- Thanks
- 23
Thanked 9 Times in 7 Postsnearly 900 kb but the phone was very slow but it coud be that there is an error in there... i will look at the file again when i find time
@split
in theory yes but there is one problem that even the new small dictionary dont work as en1 or en_1 these files have some strange code at the beginning that i dont understand yet.
@9/10 etc words
like i said when you work with a hex editor this is no problem
@howto
half manually half automated
1. first open the txt with a hex editor (xvi32)
2. look what are the hex codes between the words; e.g. at my tx it was 0D 0A
2. Search for 0D AD and replace with
81 00 00 00 [...it depends how long the longest word ist] 00 82
(I choose these values because they dont exist in the file before)
you need the spaces between the words because you will use wildcards and this gurantee that it will not take two words for one
3. open XVIscript and write
ADR 0
JOKERON FF
FIND 82 FF FF 81
REPLACE 82 BY 02
first line: make the script begin at top of the file
secound line: defines the wildcard
third line: search for words with 2 letters (each FF is one letter and if you continue u have to insert more FF)
last line: replace the hex value before the word with the correct prefix
after that copy the lines
FIND 82 FF FF 81
REPLACE 82 BY 02
and paste them with CRTL+V a while (you can hold the keys down)
it shoud look like this
ADR 0
JOKERON FF
FIND 82 FF FF 81
REPLACE 82 BY 02
FIND 82 FF FF 81
REPLACE 82 BY 02
[1000 and more times]
FIND 82 FF FF 81
REPLACE 82 BY 02
then close the script (when it asks to save say no)
then hit F9 and when its finish without a error msg hit it again until you have the error msg that the script coudnt find the 82 FF FF 81 line
save the file
exit hex editor
start hex editor end reopen the file
and again you go to script and search now for words with three letters... etc
ADR 0
JOKERON FF
FIND 82 FF FF FF 81
REPLACE 82 BY 03
[...]
-------
word length 1-9 = prefix 01 - 09
word length 10-15 = prefix 0A - 0F
word length 16-25 = prefix 10 - 19
word length 26-31 = prefix 1A - 1F
-----------------------------------
@ the a ä problem: it has nothing to do with missing char set becuse words with ä within don't have problems... its because a single ö , ä , ü are handled like words and without these chars on the keyboard you cannot use these as a start letterLast edited by Skoddi; 09-17-2009 at 11:05 AM.
-
09-18-2009 #20
i wrote a short script to create the dictionaries. here my results:
MeizuMe.com_Input_Dic_Dutch.cab
MeizuMe.com_Input_Dic_French.cab
MeizuMe.com_Input_Dic_German.cab
MeizuMe.com_Input_Dic_Hungarian.cab
it's with english
all have 10000 words except hungarian
the words are with üöäß etc. but you cannot enter a new word with these special chars
here my pascal-script (without exception-handling etc.):
Code:program dict; var workfile: Textfile; resultfile: Textfile; filename: string; tmp: string; begin writeln('Please enter File-Path:'); readln(filename); assign(workfile, filename); assign(resultfile, 'C:\Temp\en2'); reset(workfile); rewrite(resultfile); while not(eof(workfile)) do begin readln(workfile, tmp); tmp := lowercase(tmp); writeln(tmp); write(resultfile, char(length(tmp))+tmp); end; close(workfile); close(resultfile); writeln('All done!'); readln(); end.Last edited by crimson05; 09-18-2009 at 04:52 PM.
-
Similar Threads
-
Rockbox Project (M6)
By vcf in forum RockboxReplies: 420Last Post: 07-20-2011, 09:29 AM -
[project] Windows CE 6
By evow04 in forum Modding & DevelopmentReplies: 29Last Post: 01-11-2011, 07:49 PM -
MDict dictionary for M8
By Exfrimenta in forum M8 ApplicationsReplies: 38Last Post: 03-26-2010, 01:41 PM -
How can I turn of the T9 dictionary when writing message ?? (Meizu M8)
By rori in forum General Meizu M8Replies: 0Last Post: 04-02-2009, 10:56 PM -
Change or edit dictionary file
By MeisterLampe1 in forum General Meizu M8Replies: 2Last Post: 03-26-2009, 12:19 PM



Reply With Quote


