Tuesday, February 23, 2010

Can I use latin1 to store utf8 data?


I've table contains text column and its charset is latin1, and i can store Arabic text ( and non English character) in this column and retrieve it, i don't know how is it?

So how is that? and why I need utf8?

CREATE TABLE `post` (
`postid` int(10) unsigned NOT NULL AUTO_INCREMENT,
`threadid` int(10) unsigned NOT NULL DEFAULT '0',
`parentid` int(10) unsigned NOT NULL DEFAULT '0',
`username` varchar(100) NOT NULL DEFAULT '',
`userid` int(10) unsigned NOT NULL DEFAULT '0',
`title` varchar(250) NOT NULL DEFAULT '',
`dateline` int(10) unsigned NOT NULL DEFAULT '0',
`pagetext` mediumtext NOT NULL,
`allowsmilie` smallint(6) NOT NULL DEFAULT '0',
`showsignature` smallint(6) NOT NULL DEFAULT '0',
`ipaddress` varchar(15) NOT NULL DEFAULT '',
`iconid` smallint(5) unsigned NOT NULL DEFAULT '0',
`visible` smallint(6) NOT NULL DEFAULT '0',
`attach` smallint(5) unsigned NOT NULL DEFAULT '0',
`importthreadid` bigint(20) NOT NULL DEFAULT '0',
`importpostid` bigint(20) NOT NULL DEFAULT '0',
`infraction` smallint(5) unsigned NOT NULL DEFAULT '0',
`reportthreadid` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`postid`),
KEY `userid` (`userid`),
KEY `threadid` (`threadid`,`userid`),
KEY `datline_idx` (`dateline`),
KEY `threadid_date` (`threadid`,`dateline`),
FULLTEXT KEY `title` (`title`,`pagetext`)
) ENGINE=MyISAM AUTO_INCREMENT=32451742 DEFAULT CHARSET=latin1

7 comments:

  1. If the table is in latin1 MySQL will by default try to convert utf8 encoded characters to latin1 when you insert them.

    But if you tell MySQL that the connection is latin1 (e.g. SET NAMES 'latin1'), this tells MySQL that all the characters are already latin1 and so it won't try to convert anything.

    So by using SET NAMES 'latin1' you can insert utf8 characters into a latin1 table without them being affected. But you may have problems if you try to use string functions on those characters, as the string functions will believe the characters are encoded in latin1.

    I assume you have SET NAMES 'latin1' being executed or your connection is otherwise being specified as a latin1 connection.

    ReplyDelete
  2. It is a very bad idea to store encoded strings in a table column defined with another charset/encoding than the encoding of data themselves. There is no guarantee how different clients are able to or are not able to handle this.

    A lot of old PHP apps do store utf8 strings in latin1 tables actually, but as they were developed for early MySQL version (3.x, 4.0) this was a necessary workaround for lack of Unicode support in MySQL. Now it is just stupid to do it I think.

    ReplyDelete
  3. Even if it actually works, and you can read out the data fine later...it is not recommended.
    One thing you will learn very quickly is when you want your application to search for something in the database using LIKE. The search will suddenly be case-sensitive since mysql doesn't know how to convert your strings, and sorting will also be wrong.
    It's also quite hard to convert the data back into the correct encoding. So just use UTF-8 from the start, and remember...both tables/columns AND the mysql connection must be utf-8. I recommend to just set it default to utf-8 in your my.cnf

    ReplyDelete
  4. Of course it's possible. Data is just data. A large part of what utf8 means in mysql is just semantics.

    1) You're telling mysql that you're going to put latin1 in a field, then storing something else, so now it has incorrect semantics.

    When you do GROUP BY col, or ORDER BY col, or ever WHERE col='myvalue' it has to make assumptions about what equality or ordering means for the language you specify. But you've lied to it, so it's going to get it wrong.

    Don't lie to your database about your data!

    2.) Utf8 is multi-byte, up to 3 bytes per character. When you define CHAR(1) that creates 3 bytes in utf8, but only 1 byte in latin1. Guess what happens when you try to store a multi-byte character in there? Yup, garbage. Do you really want to micro manage the byte length of all your fields, or would you rather just define character length? It's not always 1:1.

    ReplyDelete
  5. Exactly what gtowey said.

    MySQL will -store- whatever you give it as binary data. It doesn't really care.

    However if you're going to use any MySQL functions on it, like string functions for example, it's not going to work unless you fed it the same binary data stuff.

    If you tried on the command line to query it, it won't work most likely. But the same encoded text will behave the same way if it comes from the same source.

    ReplyDelete
  6. MySQL UTF8 is only 16bit, so I think it is a enough reason not to rely on MySQL providing UTF8 support.

    In fact, I also doubt how many of you really need string functions, string comparsion in the SQL level.

    latin1 works, varbinary also work.

    p.s. wikipedia also don't use UTF8 in MySQL.

    ReplyDelete
  7. Putting utf-8 string into latin1 table is really bad. But mistake is done and it is there. How do I convert the table into utf-8 and keep all utf-8 characters in the correct encoding? It seems no matter how I tried, utf-8 characters are treated byte-by-byte and could not preserve its original format.

    ReplyDelete