Welcome to the Java Programming Forums


The professional, friendly Java community. 21,500 members and growing!


The Java Programming Forums are a community of Java programmers from all around the World. Our members have a wide range of skills and they all have one thing in common: A passion to learn and code Java. We invite beginner Java programmers right through to Java professionals to post here and share your knowledge. Become a part of the community, help others, expand your knowledge of Java and enjoy talking with like minded people. Registration is quick and best of all free. We look forward to meeting you.


>> REGISTER NOW TO START POSTING


Members have full access to the forums. Advertisements are removed for registered users.

Results 1 to 3 of 3

Thread: How to Properly Read UTF-8 From InputStream

  1. #1
    Member
    Join Date
    Apr 2014
    Posts
    93
    Thanks
    3
    Thanked 7 Times in 7 Posts

    Default How to Properly Read UTF-8 From InputStream

    Hi,

    Crossposted due to no answers yet in my other thread:

    https://stackoverflow.com/questions/...om-inputstream

    I keep getting escaped Unicode strings (slash + U) from the InputStream, instead of the correct characters. I found others having the same issue and their problems were solved by specifying UTF-8 in the InputStreamReader constructor:

    https://stackoverflow.com/questions/...tream-as-utf-8

    https://www.mkyong.com/java/how-to-r...m-a-file-java/

    This is not working for me and I don't know why. No matter what I try, I keep getting the escaped unicode values (slash-U + hexadecimal) instead of the actual language characters. What am I doing wrong here? Thanks in advance!

    // InputStream is is a FileInputStream:
    public void load(InputStream is) throws Exception {
     
        BufferedReader br = null;
     
        try {
            // Passing "UTF8" or "UTF-8" to this constructor makes no difference for me:
            br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
            String line = null;         
            while ((line = br.readLine()) != null) {
                // The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
                System.out.println("got line: " + line);
            }
        } finally {
            if (br != null) {
                br.close();
            }
        }       
    }

    If I use a ResourceBundle against the same properties file, that gives me the correct Chinese characters as expected. So I know it's not a font issue, and I'm pretty sure it's not related to the encoding of the properties file. This line:

    new InputStreamReader(is, StandardCharsets.UTF_8)

    Seems to fix it for everyone else except me. I've also tried various other character sets ("GB2312", "GBK", "BIG5") with no change at all. I think I may be barking up the wrong tree at this point, but I just don't know where else to look for the cause of this. Thanks again!!

  2. #2
    Member
    Join Date
    Apr 2014
    Posts
    93
    Thanks
    3
    Thanked 7 Times in 7 Posts

    Default Re: How to Properly Read UTF-8 From InputStream

    In case anyone else out there encounters the same troubles, I was able to find a solution. Since the ResourceBundle was always doing the right thing for me, I dug into why that is and found that java.util.Properties is doing all the magic with a loadConvert() function. After the BufferedReader gives me a line of text from the file, I need to explicitly decode the Unicode escaped characters in that String, kind-of like this:

     
    	public void load(InputStream is) throws Exception {
     
    		BufferedReader br = null;
     
    		try {
    			// Passing "UTF8" or "UTF-8" to this constructor makes no difference for me:
    			br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
    			String line = null;         
    			while ((line = br.readLine()) != null) {
    				// The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好"
    				System.out.println("got line: " + line);
    				line = decodeUni(line);
    				// The following prints "decoded line: chinese = 你好" exactly as it should!
    				System.out.println("decoded line: " + line);
    			}
    		} finally {
    			if (br != null) {
    				br.close();
    			}
    		}       
    	}
     
    	// Converts encoded "\\uxxxx" to unicode chars
    	private String decodeUni(String string) {
     
    		char[] charsIn = string.toCharArray();
    		int len = charsIn.length;
    		char[] charsOut = new char[len];
    		char ch;
    		int outLen = 0;
    		int off = 0;
    		int end = off + len;
     
    		while (off < end) {
    			ch = charsIn[off++];
    			// Does aChar start with "\\u" ?
    			if (ch == '\\') {
    				ch = charsIn[off++];
    				if(ch == 'u') {
    					// Yep! Convert the hex part to the correct character.
    					int value = 0;
    					for (int i = 0; i < 4; i++) {
    						ch = charsIn[off++];  
    						switch (ch) {
    							case '0': case '1': case '2': case '3': case '4':
    							case '5': case '6': case '7': case '8': case '9': {
    								value = (value << 4) + ch - '0';
    								break;
    							}
    							case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': {
    								value = (value << 4) + 10 + ch - 'a';
    								break;
    							}
    							case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': {
    								value = (value << 4) + 10 + ch - 'A';
    								break;
    							}
    							default: throw new IllegalArgumentException("Malformed \\uxxxx encoding: " + string);
    						}
    					}
    					charsOut[outLen++] = (char)value;
    				} else {
    					// Starts with a slash but not "\\u", handle the other possible escaped characters.
    					switch (ch) {
    						case 't':
    							ch = '\t';
    							break;
    						case 'r':
    							ch = '\r';
    							break;
    						case 'n':
    							ch = '\n'; 
    							break;
    						case 'f':
    							ch = '\f';
    							break;
    						default:
    							break;
    					}
    					charsOut[outLen++] = ch;
    				}
    			} else {
    				// Doesn't start with a slash, leave as-is.
    				charsOut[outLen++] = ch;
    			}
    		}
    		return new String(charsOut, 0, outLen).trim();
        }

  3. #3
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,042
    Thanks
    63
    Thanked 2,708 Times in 2,658 Posts

    Default Re: How to Properly Read UTF-8 From InputStream

    Thanks for posting the solution.
    If you don't understand my answer, don't ignore it, ask a question.

Similar Threads

  1. Replies: 1
    Last Post: January 23rd, 2013, 07:29 AM
  2. 1 packet delay with inputstream
    By tobi06 in forum File I/O & Other I/O Streams
    Replies: 2
    Last Post: October 27th, 2011, 04:53 AM
  3. [SOLVED] Can't Get Network I/O To Write And Read Properly...
    By bgroenks96 in forum What's Wrong With My Code?
    Replies: 36
    Last Post: July 7th, 2011, 10:36 AM
  4. Looping an inputstream
    By Trunk Monkeey in forum What's Wrong With My Code?
    Replies: 1
    Last Post: May 25th, 2011, 02:39 PM
  5. [SOLVED] InputStream to File
    By aussiemcgr in forum Java Theory & Questions
    Replies: 2
    Last Post: May 2nd, 2011, 01:05 PM

Tags for this Thread