经常更新
Future Plans
- Sample application coming
- Localization
- Java 5 XML Properties, Timezones, currency and text-orientation issues (req by Christopher Brown)
If anyone has good material on the issues above you'd like to share, please let me know and I'll add them to this document. Questions welcome, too!
Introduction
This short Guide tries to cover all the details required to write a web applications that are capable of handling Unicode (UTF-8) character set in every step back and forth. With Unicode, a single internationalization process can produce code that handles the requirements of all the world markets at the same time. I won't go the benefits throughly, since the Unicode Consortium has better answers than I could never produce.
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
You should know this is not theoretical document. This is "On your knees, codemonkey. Hands into the dirt! NOW!" -type of a document. You may also be interested in the official Java Internationalization FAQ. This document started as a technical point of view to internationalization (aka ensuring one character set everywhere) but I think it has to tell more about the internationalization from the user point of view - aka localization (like collation, timezones, currencies,...)
Typical data flow in the web application:
Browser <=> Web Server <=> Application Server <=> Your Application <=> JDBC Driver <=> Database
If you look at the data flow, you'll soon realize that one glitch somewhere in the process ruins the whole process. This Document aims to help YOU write (web) applications that work flawlessly when it comes to the character encoding issues.
I am using Resin application server by Caucho Technology in the tips, but I hope to support other application servers as well in the future. User feedback is more than welcome!
This material is copyrighted material of the author and all the contributors. All the credit to those who have contributed. Please do not copy this document, link it instead. I'm aware that this document may have inaccuracies. If you take a copy of this I will never be able to fix the the copied document. I recommend you report all inaccuracies to me.
Editing Files
You are propably using some text editor to write your code, xml configuration files, etc. I suggest you find a decent text editor which supports UTF-8 and from then on write all files in that encoding. Accept no other encoding from anyone. Make sure your text editor also reads and writes the UTF-8 files correctly. Check out the tools section.
Editing .properties Files
You shouldn't edit property files with normal text editor. Why's that? Because in the .properties files the text is encoded in the unicode (UCS-2) escapes. Here is an example:
key=This is a sample key.\u00F6\u00E4\u00E5
If your text editor can not write these files correctly, then I suggest you changing your text editor. Check out the tools section.
Default File Encoding
VERY IMPORTANT!
Remember that this will also affect to System.out and System.err, so if you are doing any output using them you might get incorrect results.
When Java application opens the file it assumes the file is encoded in the default file encoding and that depends on the platform under the JVM. When you start editing files in the UTF-8 you should also tell your application that all your files have UTF-8 encoding.
Here is how to do that:
java -Dfile.encoding=UTF-8 MyGreatApp
Since you are using the application server I suggest you put that property setting somewhere in the startup script.
Compiling Source Code
You need to specify the encoding for your source files, because remember you are using UTF-8 and not the default encoding for the platform!
javac -encoding UTF-8 MyPreciousSource.java
Configuring Application Server
Let's take a look at this snippet from the resin.conf config file. I have used bolding the important sections.
1 <caucho.com> 2 <java compiler='C:\j2sdk1.4.2_03\bin\javac.exe' args="-g -encoding UTF-8"/> 3 <http-server app-dir='c:\workspace\myproject\web' class-update-interval='2' 4 character-encoding='UTF-8'> 5 <classpath encoding='UTF-8'/> 6 <jsp static-encoding='false' fast-jstl='false'/> 7 <http port='80'/> 8 <servlet-mapping url-pattern='/servlet/*' servlet-name='invoker'/> 9 <servlet-mapping url-pattern='*.jsp' servlet-name='com.caucho.jsp.JspServlet'/> 10 </http-server> 11 </caucho.com>
Analysis
- Second line: Specify character encoding used by source files. Okay, this may be redundant since it's mentioned in the classpath config too.. but I'm getting paranoid on this.
- Take a look at to the 4th line. That defines that application server should use UTF-8 for example when reading parameters from the HTTP Request.
- 5th line defines the encoding for source code to compile in the classpath directory (auto compiling).
- 6th line disables any static encoding and fast-jstl. Fast-jstl implementation uses iso-8859-1 character set, so naturally we wish to disable such behaviour. The disabling of static encoding may or may not be useful. I'll have to verify this.
Configuring the Web Server
Check your web server, for example Apache, that it does not add or override any headers that might conflict with your application.
For example, the AddDefaultCharset directive in the Apache configuration is really an override. Not default. If you don't do this the Apache will replace Content-type -header with iso-8859-1 encoding which isn't really what you want. Now, be a good codemonkey, edit your httpd.conf or site specific config and make sure it has line which says:
AddDefaultCharset Off
Servlets
You have to set the encoding for the request before reading request parameters or reading input using getReader(). Likewise you have to set the encoding for output before writing output using getWriter().
Here is an example:
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
public class MyServlet extends HttpServlet {
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException,java.io.IOException {
// set the encoding for the input parameters
request.setCharacterEncoding("UTF-8");
String test = request.getParameter("test");
if( test==null ) test = "input parameter 'test' was null";
// set the content type AND encoding for the output
response.setContentType("text/html;charset=UTF-8");
PrintWriter out = response.getWriter();
out.print( test );
}
}
Thanks to Vesa Hiltunen for bringing this up. I forgot this from the initial version.
JSP Files
VERY IMPORTANT!
Never use <%@ include file="..." %> or <jsp:include page=".."/> to include same settings for every page because they don't work the way you think they do. Always have these settings in every JSP file.
Let's take a look at to the JSP file fragment.
1 <%@page 2 contentType="text/html; charset=UTF-8" 3 pageEncoding="UTF-8" 4 %>
Analysis
- Second line defines the content type and encoding for the output.
- Third line defines the encoding for the JSP file, which naturally should be UTF-8
HTML Pages
Let's look at the code once again.
1 <html> 2 <head> 3 <title>My Precious Form</title> 4 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 5 </head> 6 <body> 7 ... 8 </body> 9 </html>
Analysis
- Look at the 4th line. This is necessary to tell the browser that we are sending HTML with UTF-8 encoding (keep in mind you have to write the document in UTF-8 encoding, too!). This is propably unnecessary as it is already done by the JSP, but I suggest you still put it there. The Browser may fail to recognize the HTTP header.
JavaScript
So you wish to do some JavaScript output? Can do!
...and as always, caveats exist. You have to encode string before using it. You must use JavaScript function decodeURIComponent() before using it, except if you are going to use it in the URL string. You may not use JavaScript function unescape() since it works with ascii characters only!
Example #1: in document.write()
<script language="JavaScript">
document.write( decodeURIComponent( "<%= JavaScriptUTF8Encoder.encode("some trashy scandinavian characters like öäå" ) %>" ) );
</script>
Example #2: in the URL strings
<script language="JavaScript">
location.replace('http://www.google.com/search?q=' + "<%= java.net.URLEncoder.encode("testöäå", "UTF-8") %>");
</script>
JavaScriptUTF8Encoder.java is really a slightly modified URLUTF8Encoder.java by the W3.org.
HTTP GET
If you are using HTTP GET from the form, you can follow the instructions for the HTTP POST, but be warned: GET has finite amount of data it can carry on within the URL. Some recommendations have suggested 4 kilobytes to be the limit. That limit is surprisingly quickly passed when dealing with asian languages. I recommend you to use HTTP POST and use the HTTP GET only as links.
However, if you are to send data from the link, you have to encode the data yourself. Take a look at to the URLUTF8Encoder.java from the W3.org
Example:
<a href="http://www.google.com/search?&q=test%C3%A5%C3%B6%C3%A4">My Search>/a>
HTTP POST
Let's take a look at the form
1 <html> 2 <head><title>My Precious Form</title></head> 3 <body> 4 <form method="post" accept-charset="UTF-8" method="..." 5 enctype="multipart/form-data"> 6 <input type="submit"> 7 </form> 8 </body> 9 </html>
Analysis
- 4th line tells the browser to send any input in UTF-8 character set.
enctypein the 5th line not only enables file uploads as well, but has better unicode handling. Well, that's at least what the people say on the net. I have not found this necessary. I'd like to know more of problems that have required this setting. Feedback anyone?
Database Server
First rule: Choose a database server that can handle UTF-8 or Unicode character encoding. Once I had to take part in a project which had to use MySQL version that had only iso-8859-1 encoding available and we had to support both iso-8859-1 data and KOI-8R character encodings. It was.. interesting. If I ever meet a project like that I'll shoot on sight.
I recommend using PostgreSQL server. It's versatile, fast, free and has excellent unicode support. However, you should know it lacks some of the collation features you might need in your multinational applications.
CREATE DATABASE myprecious WITH ENCODING = 'UNICODE';
JDBC Driver
In some database servers it's possible to have different character encodings for each connection. I have not seen such JDBC drivers in a while, but they use to exist. Reports anyone?
Collation
So, what is the collation? Collation is the assembly of written information into a standard order. In other words: sorting. What about that standard order? Well, that depends. Here is an example taken from the default Unicode Collation Algorithm:
Swedish: z < ö German: ö < z
As you can see, since the ordering rules may be different in different locales (you do have users both in Sweden and German, don't you?) - so has to be the returned data. Your users may get frustrated when they don't find their data from the correct place in the user interface. How to cope with that? You should be able to tell the JDBC driver on a connection basis in what locale order each query should return rows. How do you tell this? Unfortunetaly for some databases you can not. For example in PostgreSQL changing collation order requires repeating initdb. In MySQL (4.1 and above) you can use COLLATE within the SQL statements. For example:
SELECT name FROM customer ORDER BY name COLLATE utf8_german2_ci;
In Microsoft SQL Server (and propably Sybase, too?) you have to set the collation for each database field. I think this is not good behaviour since it's the user who still expects the data in his/her own locale, not in the order what the developer has chosen. Can you set connection specific collation in SQL server? Reports please.
CREATE TABLE [user].[table] ( [field1] [varchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL , [field2] [varchar] (35) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ) ON [PRIMARY] GO
Tools
Currently I'm using Eclipse and Jedit. Make sure you change the default encoding for created files!
Success Stories
Be the first one to report a success story! Did this work help you? Let us hear it!
