×

Discussion Board

Results 1 to 14 of 14
  1. #1
    Registered User
    Join Date
    Feb 2004
    Posts
    29

    Reading a multilangual file - A DETAILED SOLUTION + a Hebrew example

    Ive been looking around the archives - and couldnt find a coherant solution as to how to load multilangual plain-text files (files that arent in english). So Ive decided to look into the case and write one.

    What we need to do is to encode the file in UTF-8. Since writing the file in UTF-8 in the first place is much harder than using your favorite text editor (mine is UEDIT by the way) to write the file and then encode it, I will go over the 2nd solution.

    After youre done writing the file, you need to convert the file to UTF-8 using the excelent encoding converter SimRedo (http://www4.vc-net.ne.jp/~klivo/sim/simeng.htm). How? While loading the file, choose the encoding it was written in and press "Convert" (mine was a Hebrew file written on a WidnowsXP OS so the encoding was WINDOWS-1255). It will not convert or change the file in any way, its just a way of telling the application what was the ORIGINAL encoding. Now save the file, and while dointg it - choose UTF-8 and press "Convert" again. Great - now we have a plain text file encoded in UTF-8.

    In order to load it we need to tell the device that we're dealling with a file encoded in UTF-8. The folowing souce loads a UTF-8 file, and also takes care of diferent CR/LF conventions on diferent OS's (Im not sure thats even an issue in the mobile world, but its a good habit):

    InputStream is = getClass().getResourceAsStream("/lib/hebfile.txt");
    InputStreamReader reader = new InputStreamReader(is,"UTF-8"); //notice the encoding
    char[] buffer = new char[1];
    StringBuffer sb = new StringBuffer();
    while(reader.read(buffer,0,1) > -1) {
    //if its the end of the line
    if (buffer[0]=='\n' || buffer[0] == '\r'){
    if (sb.toString().compareTo(String.valueOf('\r'))>0)
    //encode the string in UTF-8
    { // Do something with the string
    System.out.println(sb.toString().trim());
    }
    sb = new StringBuffer();
    } else {
    sb.append(buffer);
    }
    }
    //get the last line (which might not have a '\n')
    if (sb.length()>0) {System.out.println(sb.toString().trim());}

    Depending on your font settings - you might still get gibbrish on your emulator. A good sign is lots of "XXxxXXx" or seeing just the exclamation marks.

    I hope this is helping someone, and judging by some archived forum questions - I guess it does.

    Good luck

    dd+

  2. #2
    Registered User
    Join Date
    Mar 2005
    Posts
    1
    Thank u. it helped alot

  3. #3
    Registered User
    Join Date
    Nov 2005
    Posts
    1

    Question Re: Reading a multilangual file - A DETAILED SOLUTION + a Hebrew example

    I got hebrew text loaded from an utf-8 encoded file. While all consonant are displayed, vowels just appear as small squares.
    Is it possible to display vowels on series 40 (6230) mobile phones? Can/must I install a different font? On my J2ME development kit from Sun I can see the correct characters.

    Thanks in advance,
    Eicke

  4. #4
    Regular Contributor
    Join Date
    Dec 2006
    Posts
    95

    Re: Reading a multilangual file - A DETAILED SOLUTION + a Hebrew example

    Hi,
    I'm reading a stream for the network.
    I've used Hebrew characters in my stream.
    When I read it, it appers as:

    =?utf-8?q?=D7=98=D7=A1=D7=98_=D7=91=D7=A2=D7=99=D7=91=D7=A8=D7=99=D7=AA?=

    How can it be converted?
    Thanks.

  5. #5
    Registered User
    Join Date
    Mar 2003
    Posts
    4,105

  6. #6
    Regular Contributor
    Join Date
    Dec 2006
    Posts
    95

    Re: Reading a multilangual file - A DETAILED SOLUTION + a Hebrew example

    Hi,
    I've tried to convert the Quoted-printable to byte arry and convert it again to String.
    But I still get this unreadable format.
    How can I convert it into raedable string?
    Do I need to remove the '=' before converting?
    Thanks,
    Eyal.

  7. #7
    Registered User
    Join Date
    Mar 2003
    Posts
    4,105
    I think RFC 2047 helps better than my previous link.

  8. #8
    Regular Contributor
    Join Date
    Dec 2006
    Posts
    95

    Re: Reading a multilangual file - A DETAILED SOLUTION + a Hebrew example

    I still have problem with reading non ascii characters fron the network.
    Does any body here has / solved this problem?
    Thanks,
    Eyal.

  9. #9
    Registered User
    Join Date
    Mar 2003
    Posts
    4,105
    Your example transcoded: טסט בעיברית (do not know what it means; display-direction could be wrong)
    Were you able to solve the issue with the help the mentioned RFC 2047 or are you referring to a new problem? If you have problems how to translate that RFC into Java, could you please tell how far you reached? If you have a new issue and as there are zillions of transcodings into ASCII, please, ask the person who supplies your data, first. If that source is not available (external source not under you control), you should give us an example and tell us in which context it is used. Your example above, might come from email messages – just a guess – consequently, the MIME RFCs are the start to look for. There is no single definite rule for every data format. Many data formats have their own trick.

  10. #10
    Regular Contributor
    Join Date
    Dec 2006
    Posts
    95

    Re: Reading a multilangual file - A DETAILED SOLUTION + a Hebrew example

    Hi,
    I have problem to translate RFC 2047 into Java.
    How can it be done? Do you have a code sample that do so?
    Thanks,
    Eyal.

  11. #11
    Registered User
    Join Date
    Mar 2003
    Posts
    4,105
    I cannot share an example. Which statement of chapter 2 of that RFC creates problems for you?

  12. #12
    Regular Contributor
    Join Date
    Dec 2006
    Posts
    95

    Re: Reading a multilangual file - A DETAILED SOLUTION + a Hebrew example

    How do I convert the =98 to ASCII code?
    Does it needed to be converted to ASCII?
    I have tried to remove the "=" sign and convert it by:

    int firstDigit = 9; (=98)
    int secondDigit = 8; (=98)
    byte b = firstDigit * 16 + secondDigit;
    byteArray[i] = b;
    String str = new String(byteArray);

    But, I still have problems.
    Thanks,
    Eyal.

  13. #13
    Registered User
    Join Date
    Mar 2003
    Posts
    4,105
    =98 is not a char, not a String, it has to get a byte. Several byte create a UTF-8 char, all char together build the String.

    Where do you get that whole data from? I bet it was a byte array once. Or use getBytes("US-ASCII") on that String and then work on Bytes. Try to keep it as byte array until you found out its encoding, then do a Quoted-Printable translation† and then only then create a new String from this byte array with the now known encoding:
    new String(de-encoded-text, charset)

    † As you do an email client, there should be such a translation function somewhere in your library already, otherwise borrow it from another library or implement it via Byte.parseByte("98", 16).
    Last edited by traud; 2007-12-05 at 09:45.

  14. #14
    Registered User
    Join Date
    Mar 2003
    Posts
    4,105
    If you need some examples for inspiration how to do a Quoted-Printable decoder:
    MujMail > Source > Decode.java > decodeHeaderField
    JavaMail
    Although, I do not like any of these examples. If I am not mistaken, in JavaMail the QDecoderStream should call its super class QPDecoderStream after filtering the underscore, rather than doing the rest itself (different code in both classes for the same). In MujMail the data is not handled as binary in the first run and then applying the character set in a second; instead is mixed for UTF-8.

    After decoding quoted printable, data is binary, then apply its encoding. Your example should look like:
    Code:
    String enc = "utf-8"; // should be created dynamically from between =? and ?
    byte[] bytes = { 0xD7, 0x98, 0xD7, 0xA1, 0xD7, 0x98, 0x20 /* underscore gets a space */, 0xD7, 0x91, 0xD7, 0xA2, 0xD7, 0x99, 0xD7, 0x91 0xD7, 0xA8, 0xD7, 0x99, 0xD7, 0xAA };
    String s = new String(bytes, enc);
    However, this might throw an UnsupportedEncodingException if the encoding is not known to your MIDP implementation. Consequently, you should add encoding mapping† for unknown (but important) encodings yourself, if everything fails, use the platform encoding or better a Windows-Latin-1 to UTF-16BE mapping which is a super-set of ISO-8859-1 which is a super-set of US-ASCII – because of their importance the best character set for a fall-back in my opinion. Furthermore, your device might not display all characters correctly, as the glyph is missing. You cannot do much about that except to draw your own font or use substitutes for special characters (see Unicode tables which to use). However, first make sure your implementation is correct, for example with the Euro sign which is quite a good test whether everything is fine with your implementation.

    Any further problem? Which problem do you face exactly then? These two RFC are complicated on the first glance. If you understand this above, make sure to read and understand these RFC completely as this example I gave is not complete (there could be more than one quoted printable stream per line, different character encodings, …, see the BNF grammars in those RFC). Happing coding. That is a lot of work, try partitioning your work and to prioritize these separate parts.

    † These tables can easily be added by placing the Unicode.org mappings into your JAR, load them via Class.getResourceAsStream, parse them and use them to convert to UTF-16BE. The encodings names and their alias can be imported from the IANA list. For more details on character sets…last but not least do not use all that stuff for sending an email. I recommend to limit it to a few. Start with platform encoding and change that to its preferred MIME name, then use UTF-8, then use depending on content either UTF-8, ISO-8859-1 and US-ASCII. I would not go for much more.
    Last edited by traud; 2007-12-07 at 13:22.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
×