| « system settings | Author name into feeds » |
UTF-8 Support
Unicode/UTF8 support is not very mature in b2evolution yet, but essential for languages that use non-latin charsets.
The solution seems simple: use UTF8 throughout the application!
UTF8 as default
UTF8 should be the default in the whole application. Luckily moving from iso-8859-1/latin1 to utf-8 is quite easy, because latin1 encoding is included in utf8 - so reading a latin1 encoded file is the same as reading it as if it were utf8 encoded (AFAIK).
fp: WRONG! That's only true for 7 bit ASCII. é (a common Latin1 character I use daily) is encoded differently in latin1 and utf8.
What needs to be done?
* Send "Content-Type: text/html; charset=utf-8" header with every page we generate
** Backoffice: /blogs/inc/VIEW/_menutop.php
** Frontoffice: just before we display the skin.
** Skins/general: The [meta http-equiv="Content-Type" ...] lines should be changed to "charset=utf-8" also
fp: BAD IDEA: on some shared hosts you cannot override the charset which is forced by an HTTP header directly by Apache. Sometimes though that forcing *is* UTF-8 !!
** Abandon use of $locales[xx-XX]['charset'] / locale_charset(): it should only be used to define the character encoding of the messages file, but better to just expect utf8 there
fp: NOWAY: using a single byte charset can dramatically decrease the size of the pages in some languages.
* $EvoConfig->DB['connection_charset'] should default to "utf8", because this executes "SET NAMES utf8;" and defines the encoding of the data we get from DB. (Note: "Setting character_set_connection to x [what SET NAMES does] also sets collation_connection to the default collation for x." [http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html]; this is equivalent to having "default-character-set=utf8" in the "[mysql]" section in the MySQL option file)
I've tested this locally and it seems to work very well, at least with the Arabic snippet I've taken from the forums.. ![]()
Problems with MySQL
* "utf8" is supported in MySQL since 4.1 [http://dev.mysql.com/doc/refman/4.1/en/charset-unicode.html]
We might also want to make sure, that utf8 is supported, by querying MySQL like so:
SHOW VARIABLES LIKE 'character_set%';
22 comments
1- The charset used by MYSQL
2- The charset used internally by b2evo in PHP ($evo_charset)
3- The charset used for formatting and sending data back to the user.
ATTENTION: MySQL, PHP and W3C all use different variations of the charset names. 'utf-8' vs 'UTF-8' vs 'UTF8' etc... THUS, YOU CANNOT USE THE SAME VAR FOR ALL OF THEM.
NOTE: it might be even worse: different parts of PHP (XML parser, iconv, mbstring MIGHT use different charset names)
1: The charset of MYSQL is defined inside of MYSQL. We *might* want to specify it explictely in CREATE statements.
2: The internal charset is defined by $evo_charset but not used yet (because we do no charset conversion, no mbstring support, etc...) It should THEORETICALLY be used by SET NAMES to ask mysql to send us the data in the right charset. BUT because charset names are different, we need a special var for SET NAMES.
3: The charset used for reply is defined by each locale. All outputs should be converted to the main locale selected by the user.
WARNING: we should once again define it (at least) twice for each locale. Once in W3C format and once in PHP/mbstring format.
- sending utf8 encoding header (where you can set a boolean "replace" - I cannot believe that PHP gets overridden by Apache, what you mean is probably AddDefaultCharset, which gets used by Apache, when there's no charset sent by PHP)
- issuing a "SET NAMES utf8" query before every other query
IF the blog admin requests it (and therefor has MySQL 4.1)?
I don't see a benefit of an internal charset, and also of sending the page in the locale of the user! - what if there are entries in the list of posts, that his charset (latin1 for example) does not support??
UTF8 is the way to go, IMHO and given that it dynamically assigns 1-4 bytes to a character, this should do no harm in the meaning of file lengths.
The only problem with this simple attempt seems to be: what to do about the messages files? When this utf8-behaviour gets used, they should also be encoded as utf8, shouldn't they?
Before implementing anything in this direction, I would make tests of course - this might show, that it does not matter which encoding the messages files have, because PHP converts them automagically when they get included.
> fp: NOWAY: using a single byte charset can dramatically decrease the size of the pages in some languages.
Do you mean file size? Given the dynamical range of 1-4 bytes per character this is only true, if you use a lot of "é" characters (which seem to be two-byte in utf8)
Remember: I'm talking of a global switch that has to be enabled manually. It won't change the current behaviour. Only, if there comes the glory time when we require MySQL 4.1 it may get the default behaviour (or even get removed completely).
For all the rest I'm getting tired of repeating the same things over and over again.
- internal charset : you can't do proper preg_match() strlen() etc on utf8 if you don't have mb_string for example. So utf8 by default is out of the question. Then if you get data from the DB in one specific japanese encoding and send it out in another specific japense encoding , it probably doesn't make much sense to convert it twice, once to utf8 and once again after that.
(And YES if you select a Japanese internal charset in order to display greek you will loose all characters except punctuation. But no one is preventing you from being smart... or at least from using utf-8 if you don't know better).
- locale of the user: when the user requests a specific locale, he doesn't just require a language. He also requests a specific date formatting, a specific charset, etc. I have no problem with doubling all locales with an -utf8 version. But unecessary conversions to utf8 when you don't need them is stupid (it DOES take processing to convert all outputs).
- html page output lengths : any language that is not based on the latin characters will use AT LEAST 2 BYTES per character in UTF8 when it could use only 1 in another charset. Russian/Cyrilic languages for example.
- message files: translators already find it hard to translate in their daily charset. If we force them to understand UTF8 which they never use it will only get harder. Also, there is no need to do unecessary charset coinversions when someone wants to use latin-5 all along for example (DB and page output).
Again, I'm okay for UTF8 done the right way. I am NOT okay for a global switch that boldly forces UTF-8 as an overkill solution everywhere, and if you can't support if (because you don't have the extension, or the processing power, or the bandwith) then you get screwed and stick with the current bugs... just because... err... what's the reason for refusing to consider 3 charsets exactly? Too hard to type utf-8 3 times in the conf file?
You also need PHP to support it, and it DOES NOT by default!(It will in PHP6)
In _general_, a browser sends the form encoded in the same charset like the page he got served.
See http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html which has all the details and a simple solution (using utf8).
Apart from that, I still don't see, what "internal PHP charset" should be..
At least, given the above assumption, it's the same as the user locale's charset for the INPUT data and should just stay the same for the OUTPUT data.
Therefor, the "internal" charset should be the same as the one used for output (which is the locale's charset): This way you don't have to convert INPUT to "internal" and back to OUTPUT.
What now needs to get converted to this "internal" charset, is the DB data and the data from the message file(s).
The "DB conversion" is easy: USE NAMES, which should not be an extra variable, but get mapped from the locale's charset.
The "messages conversion" is not necessary, because the message file should be encoded in the locale's charset.
Implementation than would look simply like:
- add mappings for "USE NAMES" to the locales
- do a "USE NAMES" query in the DB constructor
Of course, this requires MySQL 4.1 and mbstrings/PHP6 for locale charsets, that are neither supported by MySQL nor by PHP (without mbstrings extension).
Do I oversee something?
2) Can you understand that if data is stored in a specific charset inside of the DB and if it's stored in a speific charset inside of the browser, it is also stored in a specific charset inside of PHP?
Can you understand that PHP (below version 6) is not UNICODE by nature?
If you can understand those simple things, you must be able to understand what a "php charset" is, as opposed to the "browser/http/html output/form input charset" and/or the "db charset".
Do you have a problem with the word "internal"? Would you feel better if we called it only "PHP charset"? or "mbstring charset"? or "string manipulation charset"? or "regexp charset"?
3) Why do we need a PHP charset different from the DB charset or the HTTP charset?
Because as soon as you can handle UTF-8 and EITHER the DB or the HTTP is using UTF-8, then YOU DO WANT to handle UTF-8 internally for maximum flexibility and you loose NO performance. You just choose where the conversion gets done.
Also there is a significant risk that when PHP 6 comes out you cannot choose which charset you use internally and you will HAVE TO use UTF-8 because all hosts will set up php.ini that way.
4a) YES I AGREE that we will achieve better performance if all 3 charsets are the same, or if at least 2 of these (db+php or php+http) are the same.
4b) YES I AGREE that it will be easier to set-up if we make ALL THREE charsets UTF-8 (but that is not possible on all installations PLUS it may degrade network transfers in some languages like Russian).
4c) We WILL put a note somewhere stating "If you have charset issues and you have no idea how to solve them try to set everything to utf-8".
4d) We WILL make UTF-8 the default (for all 3 charsets) when it all software pieces involved are supporting it on a wide scale.
5) Again, what's your problem with typing in utf-8 three times in the conf instead of two??
1. User's locale is latin1, DB is UTF8.
We get input in latin1, we can ask the database to send us latin1 (MySQL does the conversion). Why would we need a PHP charset there?
2. User's locale is UTF8, DB is big5.
We get input in UTF8, we ask the DB to send us UTF8. Why would we need a PHP charset here?
You say that we would need a PHP charset in the secon case, because PHP cannot handle UTF8 always. So what should we do? Convert the input to latin1, ask the DB for latin1, too, and convert it to UTF8 again, when sending??
Re: your points:
1) No, I don't want a fourth charset. _If_ we would use accept-charset, than it should be of course our "internal PHP" charset.
2) Sure, there's a charset for the data PHP handles. And it's the same as the INPUT (and OUTPUT) charset, as long as we don't convert it.
3) The only reason to have an "PHP charset" is then, to handle data as UTF8 when either the user or the DB does not support it?!
5) I have no problem with typing "utf-8" three times: I just don't understand why we should need convert INPUT to internal and then again to OUTPUT (if "PHP charset" and "user charset" are not the same).
- I AM recommending that you can choose WHERE you convert (between DB/PHP or PHP/HTTP) in order to adapt to all funky installations of MySQL and PHP.
- I AM also requiring deployment flexibility instead of lazy design choices.
- YES I understand it is more appealing for the lazy coder to convert at DB/PHP because it's easier to just use SET NAMES instead of implementing all the mb_string mumble jumble.
- NO it doesn't always work to use SET NAMES. If you have an old/funky install of MySQL you may NOT have many charsets available in MySQL, thus very limited conversions... WHILE you MIGHT have all you need in mb_strings.
- YES I understand (now) that you feel trapped in a scenario where DB is UTF-8 and one client is requesting latin1 while ANOTHER client is requesting latin2. Using a fixed internal charset -- say latin1 -- would force a double conversion for latin2. This is not what I meant. I took fixed examples for the sake of simplicity, but I understand it's confusing now. So, YES in the case where you offer locales with different charsets (which is not recommended, see below) you might want to have the PHP charset follow the user charset automagically. Let's say you just keep it empty string in the conf which means use current-locale.
- NO, doing the above WILL NOT fulfill the lazy goal of not implementing mbstring output conversions! If you have a post in german/latin1 followed by a post in french/latin5 (i'm making this up) in blog #1, you will need gettext-messages in french/latin5 like "read more" to be converted to latin1 to fit on the same page as the german stuff. So THERE WILL BE output conversion, at least for gettext messages. Once you have it there...
- DO I need to state that mbstring converting from utf8 to utf8 or from latin1 to latin1 (when the conversion has already been done by set names) won't cost too much?
- in param(), see if $locale_charset differs from $evo_charset; if so, convert
- when connecting to DB and $db_charset differs from $evo_charset, use "SET NAMES $evo_charset"
- in T(), if $evo_charset differs from $messages_charset (for the locale's file), convert it
So we have $evo_charset, locale_charset() and $db_charset.
If $evo_charset is empty, it get's set to locale_charset().
Does this sound reasonable in general?
I think we'll also need to add conversion in format_to_output() at some point but I'll have to look at it more closely.
If $evo_charset is empty, it gets set to locale_charset() somewhere in _main.inc, once we have decided on the definitive locale we want to use.
- we connect early to DB, earlier at least than we set the locale (and therefor adjust $evo_charset if it's empty). Therefor the locale setting stuff has to move just after the DB connect, which also includes the "login procedure", which includes the UserCache and GroupCache at least..
- Currently, we activate the Blog's locale last (in _blog_main.inc.php), which does not make any sense IMHO and is currently a HUGE BUG, because it overwrites the user's locale in the front office!!
IMHO this has to be moved to get used as default BEFORE the user's locale has the last word! (would also require splitting of _main.inc.php)
- We'll have to use an output handler, just because T_() converts to our internal charset (e.g. locale_charset() == 'iso-8859-1', $evo_charset == 'utf-8'):
mb_internal_encoding( $evo_charset );
mb_http_output( locale_charset(false) );
ob_start( 'mb_output_handler' );
Apart from these "issues" it already works fine here, but needs a lot more debugging love.
I'm not saying, that it could be done easier/lazier. But at least, you said that I should mess things up.. ;p
Any remarks on those "issues".
A user has a latin1-locale, but a post is in an utf8/big5/whatever-cannot-be-converted-to-latin1-locale.
The whole page then should get sent in a locale that can display all including posts.
A user has a latin1-locale, but a post is in an utf8/big5/whatever-cannot-be-converted-to-latin1-locale.
The whole page then should get sent in a locale that can display all including posts.
Yes, that's one step beyond.
*if* we are going to display a multilingual blog where we can foresee problems like this one *and+ we know we can handle utf8, then we should analyze the user agent accepted charsets and see if we can use utf8 (or maybe another rich charset). However, it would be a waste to do so if we can save one unecessay conversion.
This should be addressed later.
Are you o.k. with all this? Especially the output buffer seems to be something you try to avoid and which would be needed if $evo_charset differs from locale_charset().
Also, given the "step beyond" (and other issues where we do not use T_() for the current user), we would need an $io_charset (which is mainly what we've called "user's locale charset" above).
-Do NOT move the login procedure etc... I think it's no big deal to SET NAMES later... (for example after having done antispam checks which use plain 7bit ascii queries).
If it proves necesssary we might issue a SET NAMES ascii at connection time in case of empty($evo_charset).
-The Blog chooses his own locale, that is a FEATURE not a bug. This is BY DESIGN. DO NOT CHANGE IT. The user locale pref is for the backoffice only.
I'm not against extending this behaviour in the future BUT I AM DEFINITELY AGAINST TOUCHING THIS NOW, in other words: I AM DEFINITELY AGAINST BREAKING EVERYTHING AT THE SAME TIME.
-Yes we need an io_charset which should be the same as the MESSAGES charset (either Blog or user) *for now*.
I'd like to avoid another layer of transliteration for version 1.8.
For the current issue it even makes things more complicate:
I've added $evo_charset/$io_charset validation to _main.inc.php, after the final locale_activate() [it's three times in there].
Now, in _blog_main.inc we don't know anymore if $evo_charset was empty before (and let it follow the blog's locale charset).
Actually we'd need to perform the same check there and also check, if the blog's locale is supported with mbstrings.
As a sidenote: why don't the feeds use the blog's locale?
And btw^2: I'll add charset information there, too - as it is mostly missing (and therefor the sidenote-point above is not really relevant anyway).
It does not make sense IMHO, because we do SET NAMES to match $evo_charset either way.
Anyway, I'll commit it like it is (with the potential of three times SET NAMES) and will polish it based on your feedback.
Another var, named requested_connection_charset should be set to the default charset we may have set in the DB params.
Then there should be a method named set_connection_charset() which does nothing but set requested_connection_charset.
Then, any time we do a query(), if requested_connection_charset != current_connection_charset, we do a SET NAMES.
There should be a transliteration abstraction library. We may simply call it charsets.inc.php or charconv.inc.php (I don't really know what's best here).
Minimum implementation is to use mb_strings or to fail.
On top of that we can add iconv support and possibly brute force PHP conversions (there is code and utf tables available).
Also when running PHP6, we can us ethe built in conversions.
As with connection_charset, the library could probably start with empty io_charset and evo_charset values and only try to set them (from the locale) when we first make a call to evo_to_io() or io_to_evo() or whatever we want to call the functions.
Because of the readability gains, I'm not sure creating single instance classes makes sense.
Same applies to DB, Request and a couple more.
Note: SomeCache makes more sense because it extends DataObjectCache.
Anyway, in this case here, I really don't feel the need for a class. Even, worse, I really feel it would be bloated.
Also, the stuff you've meant, which checks for 'mb_*' existence, is probably the code in _main.inc.php (duplicated in _blog_main.inc.php), which inits the mb-handling.
Re: DB and Request: do you really think a object makes no sense here? what about all the member variables? Would you rather have them in the global namespace?
This post has 1 feedback awaiting moderation...