MySQL Charset/Collate by Rick James
Posted: Fri Jul 19, 2019 11:46 pm
(this is an archive)
MySQL Charset/Collate
http://mysql.rjweb.org/doc.php/charcoll ... ious_cases
by Rick James
Table of Contents
The problems being addressed
Basic Concepts
History
Best Practice
Conversions and Common Errors
When do bytes get translated?
Life Cycle of a Character
How Mangling Happens on INSERT
Disaster before INSERT
Diagnosing CHARSET issues
The manual
Viewing text
Fixes for various Cases
Entering accents in CMD
Example of Double-encoding
2-step ALTER
Fixing utf8 bytes in latin1 column
Fixing mismatch between CHARSET and data encoding
Fixing mix of charsets in a column
Fixing MicroSoft thingies
Fixing "double encoding"
Fix definition and data, IF correctly encoded
Testing an in-place fix
Fixing while moving data with mysqldump
Fixing while moving data with LOAD DATA
Conversion Gotchas
4-byte utf8
Functions to know about
BOM * Byte-Order-Mark
German "sharp-s"
Where does Ö sort?
Scandanavian collations
Circumflex
Quotes
When is ö not equal to ö?
"Combining" (non-spacing) characters
Cautions
Introspection
my.cnf
Stored Procedures/Functions, Triggers
JSON
utf8 Collations
utf8 Collations Examples
Changing the collation in a SELECT
ERROR 1253 (42000): COLLATION '...' is not valid for CHARACTER SET '...'
utf8 Collation Variations:
8-bit Encodings
More references
latin1 ~= cp1252 ~= ISO 8859-1
Combining Diacriticals
Programatic Collation
LOAD DATA INFILE
Characters in Western Europe
Turkish
Arabic/Farsi
Hebrew
Cyrillic, Macedonian, etc
Other Natural Languages
REGEXP / RLIKE
Python
utf-8-sig
PHP
Other Computer Languages
MySQL 8.0
Random notes
Footnote
Brought to you by Rick James
The problems being addressed
⚈ Your web page the wrong characters coming out where accented letters should be.
⚈ You are upgrading from MySQL 4.0, where characters were simply bytes.
⚈ You are trying to wrap your head around how MySQL handles CHARSET / COLLATION.
⚈ You upgraded, and now you are getting garbage.
Please read most of this document. I know it is long (after all, I wrote it!) But you really need to understand a lot of the pieces in it, in order to solve your problem.
The tips in this document apply to MySQL, MariaDB, and Percona. Version differences are called out where relevant.
Basic Concepts
"Character" and "byte" are different! You must understand this before continuing. A "byte" is an a-bit thing; it is the unit of space in computers (today). A "character" is composed of one or more bytes, and represents what we think of when reading.
A byte can represent only 256 different values. There are over 11,000 Korean characters and over 40,000 Chinese characters -- no way to squeeze such a character into a single byte.
Charset vs collation. These are different things! 'Charset' ('character set'; 'encoding') refers to the bits used to represent 'characters'. 'Collation' refers to how those bits could be compare for inequality (WHERE) and sorting (ORDER BY). GROUP BY and FOREIGN KEY CONSTRAINTS can also involve collation. And it even can involve deciding whether two different bit strings compare 'equal'.
We address when does "acute-e" (é) not show up correctly. If that acute-e shows up as A-tilde and Copyright, then there may be an issue with the browser. Try a different browser: Chrome works; Firefox is broken.
History
1950's -- A character was (sometimes) represented in only 5 bits, on "paper tape"; no lowercase; had to "shift" to switch between letters and digits.
1960's -- A character was 6 bits; no lower case. "Punched cards" were used.
1970's -- 7-bit ASCII becomes common -- that limits you to English. And the 8-bit "byte" was invented and was coming into common usage (re: IBM 360).
1980's -- Beginning to see 8-bit encodings. Especially since 7-bit ASCII was wasting a bit of the omni-present "byte". This can handle Western European accented letters.
1990's -- The computer world realizes that there are other people in the world and embarks on Unicode and UTF8. ("UTF" = "Unicode Transformation Format")
Meanwhile, MySQL is born, but has enough problems without worrying about character sets. Through version 4.0, a CHAR is just a byte. You can put any kind of bytes, representing anything, into a VARCHAR. Thus, begins the need for this discussion.
MySQL 4.1 introduced the concept of "character set" and "collation". If you had legacy data or legacy code, you probably did not notice that you were messing things up when you upgraded. Or you started with 4.1 (or later) and "latin1 / latin1_swedish_ci" and failed to notice that you were asking for trouble.
Today, it's pretty safe to simply lay down the law and say "Use utf8 for all text." If you have version 5.5.3 or later, "Use utf8mb4 for all text."
Sure, there are other character sets, and they are useful if you have a narrow focus. But, you may as well use utf8mb4.
Another take on the History
Best Practice
Best practice is to go completely utf8mb4. I will focus on utf8 and utf8mb4, but if you choose to do otherwise, keep reading; most of this discussion can still be adapted to the charset of your choice.
For collation, probably the best 'overall' collation is utf8mb4_unicode_520_ci. Thus, xxx_unicode_520_ci collations are based on UCA 5.2.0 weight keys: https://www.unicode.org/Public/UCA/5.2.0/allkeys.txt. Collations without the "520", are based on the older UCA 4.0.0.
A web page (if that is what you have) should begin with <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Establish the characterset for talking to the MySQL server:
⚈ init_connect = 'SET NAMES utf8' in my.cnf
⚈ SET NAMES utf8 -- done immediately after connecting
⚈ SET GLOBAL VARIABLE character_set_... SET NAMES does the three you need.
⚈ In some cases in your my.cnf (my.ini) file, add this under [mysql] or [client]: default-character-set = utf8
For PHP:
⚈ (deprecated; DO NOT USE mysql_* !) mysql interface: mysql_set_charset('utf8'); (assuming PHP 5.2.3 & MySQL 5.0.7) Note: "mysql_" interface is deprecated.
⚈ mysqli interface: mysqli_set_charset('utf8') function. See mysqli_set_charset
⚈ PDO interface: set the charset attribute of the PDO dsn or via SET NAMES utf8mb4.
(from #29 in Devshed's PHP tips)
PDO manual
For Java (JDBC):
⚈ ?useUnicode=yes&characterEncoding=UTF-8 in the getConnection() call.
For Hikari (perhaps):
⚈ spring.jpa.properties.hibernate.connection.characterEncoding=utf-8
⚈ spring.jpa.properties.hibernate.connection.CharSet=utf-8
⚈ spring.jpa.properties.hibernate.connection.useUnicode=true
(or maybe it is =yes and =utf8)
add
spring:
datasource:
connectionInitSql: "SET NAMES 'utf8mb4'"
in the application.yml. connectionInitSql is used by HikariCP when it open the connection.
For Java/JSP
Something about 'filters'
SET NAMES can be invoked by your language just like other non-SELECT commands: mysqli_query(), do(), execute(), etc.
I digress here for a note about GRANTs. 'root', and any other user from GRANT ALL ON *.*, because it has "SUPER" privilege, will skip the init_connect, so that technique is not 'perfect'.
SET NAMES
Declare most CHAR/TEXT columns in all tables as CHARSET utf8. Some exceptions: columns with hex (MD5, GUID), ... These should be BINARY or CHAR charset ASCII.
Collation?
For true text, utf8_unicode_ci is best. It handles the complex rules of "combining chararacters", etc. It handles essentially all languages simultaneously, but compromises on ordering.
utf8_general_ci is the default for utf8, so you may accidently get this. It is a little faster than utf8_unicode_ci and works ok for many situations.
utf8_bin says to just compare the bytes. CHAR/TEXT utf8 with utf8_bin validates (on INSERT) that the bytes comprise valid utf8 bytes, but does not do anything useful for comparisions other than exact (no case folding, etc) equality. BINARY/BLOB should be usually be used instead CHAR+utf8; this stores the bytes without any checking.
Alas, MySQL's collations think of case folding and accent-stripping as equivalent. That is, there is no utf8 collation for case folding, but keeping accents distinct. Do SHOW COLLATION LIKE '%cs'; to see the few collations that work that way.
This document is evolving. It started with just 'utf8', but the 'standard' is becoming 'utf8mb4'. The document is inconsistent as to which it specifies. If you are running MySQL before 5.5.3, you have only 'utf8'. If you need Emoji or Chinese, then you need 'utf8mb4'.
Conversions and Common Errors
This is a summary of how things work, and what is likely to go wrong.
There are 3 dimensions to character set problems:
⚈ How the client's bytes are encoded when INSERTing, and what encoding you want when SELECTing;
⚈ What SET NAMES you use (or what the default is);
⚈ The CHARACTER SET on the column definition.
All is well if the SET NAMES agrees with the encoding of the bytes in the Client. The column's CHARACTER SET need not agree with SET NAMES; if they differ, a conversion will be performed for you. If your characters exist in both encodings, the conversion will be transparent. If a source character does not exist in the target encoding (example: when converting a Chinese character from utf8mb4 to latin1), a "?" is usually put in its place.
The main thing that can go wrong is that SET NAMES can disagree with the Client's bytes. Let's walk an acute-e (é) through the INSERT and SELECT. First, note the encodings for é:
⚈ Hex for latin1's 1-byte encoding: E9
⚈ Hex for utf8's 2-byte encoding: C3A9
(BTW, this and some other encodings are enumerated below.)
Case 1: SET NAMES latin1 (the default) is in effect and your application is thinking in utf8. It INSERTs é encoded as utf8 (hex C3A9):
⚈ C3A9 will become two latin1 characters: 'é' "on the wire". (C3 represents à in latin1, etc)
⚈ If the CHARACTER SET for the column is latin1, then those 2 characters (2 bytes) are inserted.
⚈ If the CHARACTER SET for the column is utf8, then they are converted to utf8: Ã ⇒ C383 and © ⇒ C289; you insert 4 latin1 characters (4 bytes: C383C289) into the table.
Continuing the case, let's do a SELECT:
⚈ For utf8 column, the C383C289 is converted to latin1 é (C3A9) for sending across the wire to the client.
⚈ For latin1 column, no conversion is performed, so, again, C389 goes across the wire.
⚈ The client receives the two bytes C389, thinking it is é (because of SET NAMES). However, the rest of your application is thinking 'utf8', it sees it as é.
C383C289 is an example of what I call "double encoding" because of the two 'wrong' conversions from latin1 to utf8. The resulting SELECT (at least for European text) looks exactly like what you INSERTed. However, comparisions (WHERE x>y) and sorting (ORDER BY) and, in a few cases, equality (x=y) will not work as expected.
Now, let's flip the situation: Case 2: SET NAMES utf8, but you are inserting é encoded as latin1 (E9):
⚈ SET NAMES is falsely asserting that E9 is a valid utf8 encoding of something, which it is not.
⚈ The INSERT finishes (with a WARNING that you probably ignored), but the string is truncated before the é.
⚈ The CHARACTER SET of the column does not matter since the string is truncated.
⚈ A SELECT will get only the truncated string.
Case 3: One application INSERTs and another SELECTs, but they are using different SET NAMES. All sorts of messes can occur.
Case 4: You have an old latin1 table (possibly dating back to MySQL 4.0, which did not really have any CHARACTER SET concept) and you want to modernize it to utf8 or utf8mb4. First, you need to see what is in it. (A later section discusses using HEX to help.)
⚈ All ASCII -- no problem.
⚈ All latin1 (eg, E9 for é) -- Continuing to call it latin1 is fine, unless you anticipate adding some Asian text in that column.
⚈ All utf8 (eg, C3A9 for é) -- Trickier. ALTER to VARBINARY/BLOB, then ALTER back to VARCHAR/TEXT. See the 2-step ALTER below.
⚈ A mixture of latin1 and utf8 -- Warm up your resume.
Case 5: A table is declared to be latin1 and correctly contains latin1 bytes, but you would like to change it to utf8.
⚈ ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4; -- This actively changes the necessary bytes in the columns
or
⚈ ALTER TABLE tbl MODIFY col1 ... CHARACTER SET utf8mb4;
is similar to the above, but works only one column at a time, and needs exactly the right stuff in the MODIFY clause. Hence, it would be quite tedious.
What symptoms exist for the cases?
Case 1 is virtually symptomless. SELECT usually gets what was INSERTed; comparision problems may go unnoticed.
Case 2 is often recognized by having truncated strings.
Case 3 is a mess.
Case 4 is symptomless. However, if the conversion is not done properly, you can end up with some other case.
Case 5 is symptomless, but you want to branch out in the world from your original, limited, CHARACTER SET.
When do bytes get translated?
If you do all the "right things", you will accidentally have a "correctly" performing application. You need to understand when and how bytes are modified, in order to fix existing problems and to build your application correctly.
Short answer: You tell MySQL how the bytes in your application are encoded; you tell MySQL what encoding to use in your tables; MySQL converts as needed when INSERTing / SELECTing.
You tell MySQL about your application bytes by using SET NAMES or the underlying variables. You tell MySQL about the table bytes in CREATE TABLE.
Actually, the charset and collation of a column has a long, not too consistent, defaulting mechanism:
⚈ Default for system
⚈ Default for a database ...
⚈ Default for a table
⚈ Settings for a column
⚈ Fields/literals can be converted on the fly
⚈ A different collation can be specified in a WHERE clause...
To see what charset & collation are in effect for a column, use SHOW CREATE TABLE.
Life Cycle of a Character
Data ➡ Your client application ➡ INSERT into MySQL table. Simple? No.
You get the data from somewhere. Is the e-acute in "Beyoncé" represented in one latin1 byte, hex E9? Or is it the two-byte utf8 C3 A9? Not knowing which you have is the first step toward having a mess in the database. More in a minute.
Meanwhile, you have declared a column in a table in a database in MySQL to be utf8. (Collation is not matter in this discussion.)
Now you INSERT INTO Tbl (name) VALUES ("Beyoncé"). And the mess continues.
What that value being inserted encoded in latin1 or utf8? MySQL is happy to take either, and will gladly do any transcoding to store it properly in the table. But you have to tell it! The default may not be right!
Your application announces to MySQL the encoding via any of
⚈ The VARIABLE character_set_client
⚈ Executing SET NAMES
⚈ init-connect in my.cnf (my.ini) (Caution: root bypasses this)
I recommend you Execute SET NAMES in your application, right after connecting to the database. This reminds you, in your code, that you are establising for the connection.
Example of a mess: The data is in utf8, the connection settings are in latin1. The table is in utf8. So, MySQL converts each of C3 and A9 (which it was told was latin1) to utf8, leading to 4 bytes in the table: C3 83 C2 A9. I call this double-encoding. The problem was caused by lying (usually accidentally) to MySQL about what encoding was in the application's data.
Another example: The data was loaded in an old system in the following manner. The application got utf8 bytes and blindly stored them in a latin1 column (using the default latin1 settings for character_set% -- or it was done in 4.0, which had no character set concept). This left e-acute as 2 bytes in the table in a latin1 column. Then, you decide to 'do the right thing' and switch to utf8. Now what?
BIG5 example: You have an application that lets the user enter data into a form. This data is stored into a VARCHAR. All along the way, you use latin1. But the user connected from China with his browser set to BIG5. The latin1 column happily accepted the 2-byte Chinese characters. But your application reading the data has no clue of actual encoding. That same user, connecting in that same way, and rereading his data, will probably see the latin1 treated as BIG5, and be oblivious of the "two wrongs make a right" that happened. But display that text elsewhere, and you have a mess.
How Mangling Happens on INSERT
When INSERTing a string, 3 things make a difference on whether the inserted value is good or bad.
⚈ Data in client is _encoded_ latin1 versus utf8
⚈ SET NAMES (or equivalent) is latin1 versus utf8
⚈ The target column's CHARACTER SET is ascii vs latin1 vs utf8
The outcome is good or bad according to this _ordered_ list of possibilities:
Client Byte Encoding SET NAMES column CHARSET Outcome
-------------------- ---------- -------------- ---------
7-bit (eg, English) (any) (any) GOOD: bytes stored unmodified
latin1 8-bit utf8 (any) BAD: String truncated
any 8-bit byte (any) ascii BAD: '????;
utf8 (not West Europe) utf8 latin1 BAD: '????'
utf8 (west Europe) utf8 latin1 GOOD: latin1 has the equivalent values
(any) Matches Enc Matches Enc GOOD: bytes stored unmodified
(any) Matches Enc different GOOD: bytes transcoded to CHARSET
utf8 8-bit latin1 latin1 BAD: 'é' -> 'é' and similar garbage; ; see "2-step ALTER"
utf8 8-bit latin1 utf8 BAD: see "double encoding"
In the cases of truncation and '?', data is lost. In the other two 'BAD' cases, it is possible to repair the table (in 2 different ways).
Mixing of utf8 and utf8mb4 (client byte encoding vs. SET NAMES vs. CHARSET) may lead to truncation. Perhaps: SET NAMES utf8 -> truncation and CHARSET utf8 -> '?'
Disaster before INSERT
Most Mojibake starts even before the text is in the table. Let's dissect the steps of doing an INSERT.
Step 1 -- Get the data.
First you get the data from somewhere
⚈ Type it in, using a special keyboard, or keyboard tricks
⚈ Load it from a file
⚈ Create it from constants in your client programming language
⚈ Receive text from an html form
What do you have? You probably don't know. There is no single standard for what bytes are generated.
Step 2 -- INSERT
Let's walk through inserting a single character: e-acute. INSERT INTO tbl VALUES ('é'); or $str = ...; INSERT INTO tbl VALUES ('$str'); Well, this is really two steps.
Two likely cases for the encoding of e-acute:
⚈ latin1 (or code pages 1280, 1282): hex E9
⚈ utf8: 2 hex bytes: C3A9
Step 2a -- Send the command to the MySQL server
The bytes that make up the INSERT statement are sent from the client. Here enters the VARIABLE character_set_client. This tells the client what the bytes mean.
What is logically sent across the wire:
Client byte(s)
character_set_client: E9 C3A9
latin1 é Ã ©
utf8 é
⚈ Upper-left and lower-right: The correct 'character' is sent.
⚈ Upper-right: Since you said the bytes are 'latin1', then C3A9 must be the two chars é.
⚈ Lower-left: Since E9 (and what follows) is not valid utf8, the string is truncated.
Step 2b -- Do the INSERT
The 'characters' on the wire are now transliterated to the desired encoding for the table.
On the wire
column's charset: é Ã ©
latin1 E9 C3A9
utf8 C3A9 C383C2A9
⚈ Left column: all is well.
⚈ Upper-right: Mojibake. Treated as latin1, C3A9 looks like é
⚈ Lower-right: "Double encoding" -- discussed at length below.
See also: Mojibake
Thread with data entry issues.
Graphemica é
Another thread
Diagnosing CHARSET issues
You have some non-English characters that are showing up strangely, perhaps as a question mark (?), perhaps as two letters, when you were expecting one, such as ç is displayed as ç.
First, let's get the lay of the land:
SHOW SESSION VARIABLES LIKE 'character_set%';
| character_set_client | latin1
| character_set_connection | latin1
| character_set_results | latin1
(There will be other rows; these three are the important ones.) SET NAMES sets those three; this example shows the result of SET NAMES latin1, not the recommended setting. More later.
SHOW CREATE TABLE tbl will show the charset and collation for each column. More later.
You have data in a table, but it is garbled when displaying it. The first thing to do is to decide whether the data was stored correctly. Find a small cell that is bad, and do this:
SELECT x, HEX(x), LENGTH(x), CHAR_LENGTH(x) FROM ... WHERE ... LIMIT 1
You will get 4 columns:
⚈ x -- probably not useful, but may help confirm that you got the cell you wanted.
⚈ HEX(x) -- This lets us look at the encoding; more later.
⚈ LENGTH is the number of _bytes_.
⚈ CHAR_LENGTH is the number of _characters_. (Remember the distinction, above?)
LENGTH vs CHAR_LENGTH -- Are they equal? For correctly stored utf8, they should be not equal. Let's look at the two lengths of specific types of characters: Looking at a single utf8 character, would see:
⚈ ASCII: 1:1
⚈ LATIN1: 1:1
⚈ English in utf8: 1:1 -- Ascii is a subset of utf8
⚈ Western European accented letters: 2:1 -- one character is encoded in two bytes
⚈ Eastern Europe: also 2:1 (mostly)
⚈ East Asia: mostly 3:1
⚈ "double encoding" 4:1 (Europe) or 6:1 (Asian)
⚈ With utf8mb4, a few Chinese characters will be 4:1
A typical phrase in French might have LENGTH=20 and CHAR_LENGTH=17, implying that there were 17 characters, 3 of which had accents.
latin1 is happy to cram Western European characters into 1 byte.
"Double encoding" is a term I made up for the following situation. It is a case where "two wrongs make a right".
⚈ Table defined utf8
⚈ INSERTer declared latin1 (used default instead of doing SET NAMES)
⚈ Data being INSERTed is actually utf8 already.
What happened:
⚈ A 2-byte letter (say, a grave-e) was correctly represented in utf8.
⚈ The INSERT statement handed the 2 bytes to MySQL, but implied the need for conversion
⚈ Each byte was converted to utf8
⚈ The table has 4 bytes.
When SELECTing, the reverse happens 4->2->1, and the user is oblivious of the bug. Looking CHAR_LENGTH will spot it. Strange orderings may happen. details of a double encoding
A good writeup on double-encoding
Double Encoding 'answer'
Looks like Mojibake, but really half-Double-Encoding
If you are moving the data using mysqldump, consider looking at the dump file via some hexdump utility. For experimenting, mysqldump lets you specify one database, one table, and constrain the rows via --where="...".
In the hexdump, you need to be able to recognize various byte patterns: In this, "yy" stands for the range 80-BF.
⚈ 09,0A,0D -- The only control character you are likely to encounter (TAB, NL, CR)
⚈ 20 -- space (handy to know when reading hex dump)
⚈ 21-7E -- Ascii characters
⚈ C2yy -- symbols
⚈ C3yy -- Typical accented letters of Western Europe
⚈ Cxyy -- More Western Europe: Latin (C3-CA), Combining Diacritical Marks (CC-CD), Greek (CE-CF)
⚈ Dxyy -- Cyrillic (D0-D4), Hebrew (D6-D7), Arabic/Persian/Farsi (D8-DB), etc
⚈ E0yyyy -- various Indian character sets, southern Asia, etc.
⚈ E1yyyy -- Cherokee, Balinese, Khmer, Mongolian, etc.
⚈ E2yyyy -- Symbols, Braille, etc
⚈ E381yy -- Hiragana (Japanese)
⚈ E383yy -- Katakana (Japanese)
⚈ Exyyyy -- E3-EA: Chinese, Japanese; EAB0-ED9E: Hangul (Korean)
⚈ EFBBBF -- BOM indicator (start of file only)
⚈ EFAxyy -- CJK Extension A
⚈ EFACyy -- Hebrew (obscure)
⚈ EFBxyy -- Arabic (obscure)
⚈ F0yyyyyy -- 4-byte utf8. (See notes about utf8mb4)
⚈ F0Axyyyy -- CJK Extension B
Note, especially, the absence of 80-BF unless preceeded by some other 8-bit code.
Codepage layout (Wikipedia ref)
Another more detailed list
(but aimed at Unicode instead of utf8)
If the data is not utf8, you may see
⚈ 00xx -- Lots of these pairs could mean Ascii in UCS2. (Shift-JIS is also possible)
⚈ FFFE (or FEFF) at start of file (BOM) -- utf8 file (not cell in MySQL table)
⚈ 00-1F -- control characters (mostly 0a, newline)
⚈ 20 -- space (handy to know when reading hex dump)
⚈ 21-7E -- Ascii characters
⚈ 7F -- DEL; rare
⚈ 80-FF -- "8-bit characters":
⚈ 80-9F -- Possibly MicroSoft Word things (smart quotes, etc)
⚈ 80-FF -- valid latin1 encodings, mostly European accented letters
⚈ A0 -- "hard space" -- mostly from MicroSoft
⚈ Cx not followed by yy (80-BF) cannot be utf8. This is usually a quick clue that it is not utf8.
Note: The Microsoft things really need to be fixed _before_ trying to store into a table. MicroSoft characters
In hex, "double encoding" will have a lot of "Cx yy Cx yy".
The manual
charset-connection.html
is that manual page that explains the interactions of the vital variables:
⚈ character_set_client -- the encoding in the client.
⚈ character_set_connection -- what to re-code characters into for client-to-server transfer
⚈ character_set_results -- what to re-code characters into for server-to-client transfer
The page also discusses collation_connection and collation_database, which have less impact.
Between that manual page and this document, I hope that you can understand things as they relate to your situation.
To see your current settings:
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
Viewing text
Alas, CMD (Windows) and perhaps other terminal emulators cannot (or will not by default) display utf8 characters. This makes it difficult to debug. And it is confusing to see incorrect characters when everything else is working correctly.
Viewing text in a browser can be misleading in a different way. Browsers may "fix" things for you. For example, 'double encoded' characters may appear to be correct in the browser!
Bottom line: SELECT HEX() is about the only thing to trust.
Fixes for various Cases
If you are running MySQL before 5.5.3, you have only 'utf8'. If you need Emoji or Chinese, then you need 'utf8mb4'. The advice in this section assumes you will be going to utf8mb4. If you are going only to utf8, adjust accordingly.
Table is CHARACTER SET latin1 and correctly encoded in latin1; want utf8mb4:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;
Column is CHARACTER SET latin1 and correctly encoded in latin1; want utf8mb4 without touching other columns:
ALTER TABLE tbl MODIFY COLUMN col ... CHARACTER SET utf8mb4;
(Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset:
First, lets assume you have this declaration for tbl.col:
col VARCHAR(111) CHARACTER SET latin1 NOT NULL
Then to convert the column without changing the bytes:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Note: If you start with TEXT, use BLOB as the intermediate definition. (This is the "2-step ALTER", as discussed elsewhere.) (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
CHARACTER SET utf8mb4 with double-encoding:
UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);
Another ref
CHARACTER SET latin1 with double-encoding: Do the 2-step ALTER, then fix the double-encoding.
Question marks / Truncation The data is lost.
Stored Routines Stored Procedures, etc, may need to be recreated if, for example, they were built with utf8 but need to be utf8mb4. StackOverflow
See also Fixing column encoding
Entering accents in CMD
If isn’t displays as isn’t, you forgot SET NAMES utf8. If isn’t displays (in the mysql commandline tool in Windows) as isn?t, see the above info.
As noted above, CMD in Windows is limited in its character set handing.
The following is a clumsy way to enter Western European characters. It depends on your knowing the hex or decimal values for the characters you want. Sorry, this won't handle Hebrew, Cyrillic, Chinese, etc.
Press and hold ALT, then
⚈ decimal digits for the code page, or
⚈ 0, then decimal digits of the decimal Unicode
digits to get the extra characters in latin1. When you let go, the accented character will appear. This implies that you need SET NAMES latin1, not utf8, since you are creating latin1 bytes, not utf8 bytes in your 'client'. For the encodings, see
⚈ shortcuts.
More
info on how to enter Unicode, and about "code pages".
The command "chcp" controls the "code page". chcp 65001 provides utf8, but it needs a special charset installed, too. some code pages
To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console
5.6 ref manual
Note: There were bugs in 5.5's use of utf8 in the commandline 'mysql'.
GNU Unifont
Example of Double-encoding
Desired string: სახლი არის
Hex in table:
C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1
Correct utf8 hex:
E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1
Analysis:
C3A1C692C2A1 -- should be E183A1 -- Georgian SAN (U+10E1) "ს"
C3A1C692C290
C3A1C692C2AE
C3A1C692C5A1
C3A1C692CB9C
20 -- space (same in utf8 and latin1)
C3A1C692C290
C3A1C692C2A0
C3A1C692CB9C
C3A1C692C2A1
Georgian, when properly encoded in utf8, should be 3 bytes per character:
E182xx
E183xx
E2B4xx
Taking the first "character" (C3A1C692C2A1, in hex):
⚈ latin1 E1 = utf8 C3A1
⚈ latin1 83 = utf8 C692
⚈ latin1 A1 = utf8 C2A1
Notice how most of the letters look very similar: C3A1C692xxyy, when they should start with E183xx.
One pass over the data to convert each two bytes (4 hex) from utf8 to latin1:
SELECT
HEX(CONVERT(CONVERT(UNHEX(
'C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1'
) USING utf8) USING latin1))
⇒
E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1
Now to display as utf8:
SELECT UNHEX('E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1')
⇒
სახლი არის
(Ref: forum thread #418885.)
Another example of double-encoding: 'C398C2AF' should have been 'D8AF'
SELECT HEX(CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1));
⇒ D8AF, which is the utf8 for Arabic DAL.
SELECT CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1);
⇒ that character.
(Ref: forum thread #565669.)
Another... from StackOverflow
SELECT hex('☼STICKY☼'); --> C3A2CB9CC2BC535449434B59C3A2CB9CC2BC
SELECT unhex(hex(convert(convert(unhex('C3A2CB9CC2BC535449434B59C3A2CB9CC2BC') using utf8) using latin1))) -->
☼STICKY☼ |
A blog on doubly-encoded text.
This even tackles the case of a column eith both doubly-encoded and OK text.
2-step ALTER
The "2-step ALTER" applies when you wrote utf8 bytes with SET NAMES latin1 to a CHARACTER SET latin1 column. The symptom is often (but not always) that your 'é' is showing as 'é'.
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.
Similarly, TEXT -> BLOB -> TEXT.
If col is in any indexes, you might want to DROP INDEX in the first ALTER and ADD INDEX in the second. (This is for efficiency and possibly to avoid index limitations.)
Two examples follow:
Fixing utf8 bytes in latin1 column
(The "2-step ALTER", example 1)
CAVEAT! Test before using on Production!
# Set up bad case:
mysql> create table latin1 (c varchar(11) charset latin1);
mysql> insert into latin1 (c) values ('x'), (unhex('c2a3'));
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A£ | C2A3 | 2 | 2 |
+------+--------+-----------+----------------+
# Step 1 of 2 of fix:
mysql> alter table latin1 modify c tinyblob;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A£ | C2A3 | 2 | 2 |
+------+--------+-----------+----------------+
# Step 2 of 2 of fix:
mysql> alter table latin1 modify c varchar(11) charset utf8;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| £ | C2A3 | 2 | 1 |
+------+--------+-----------+----------------+
That is, ALTER to BLOB (of appropriate size), then ALTER to VARCHAR (of original size) and CHARSET utf8.
This technique should work for other cases of the right encoding sitting in a 5.1 table with the wrong CHARSET declaration.
One more note: Suggest you include the COLLATE clause on the the final ALTER that converts to VARCHAR.
Fixing mismatch between CHARSET and data encoding
(The "2-step ALTER", example 2)
The problem: cp850 encoding was inserted into a latin1 field.
The following will build an example, then walk through two solutions. The solutions should work for any mis-coded field where bytes for one charset had been stuffed into a column with a different charset declaration (usually latin1). This particular example stems from Forum thread 616834
See also: cp850 encodings
latin1 and utf8 encodings
-- First, let's establish cp850 everywhere:
-- In Windows, do "chcp 850" before entering mysql.
-- Then get the mysql commandline to agree:
SET NAMES cp850; -- Now CodePage and the following match
SHOW VARIABLES LIKE 'char%'; -- note the 3 cp850 entries
Solution A: Change CHARACTER SET.
Given: cp850 bytes are stored in a latin1 field. Goal: Leave the bytes alone, but change the column to be CHARACTER SET cp850.
-- Build sample table
CREATE TABLE f616834A (
str VARCHAR(111) CHARACTER SET latin1
);
SHOW CREATE TABLE f616834A\G
-- Populate with accented E's and I's:
SELECT @str := HEX('ÈÉÊËÌÍÎÏ');
-- Expect D490D2D3DED6D7D8 (cp850 encoding)
-- Don't expect C8C9CACBCCCDCECF (latin1)
-- Don't expect C388C389C38AC38BC38CC38DC38EC38F (utf8)
SELECT UNHEX('D490D2D3DED6D7D8'); -- expect 'ÈÉÊËÌÍÎÏ'
INSERT INTO f616834A (str) VALUES (UNHEX('D490D2D3DED6D7D8')) ;
SELECT str, HEX(str) FROM f616834A;
-- Expect gibberish (Ô?ÒÓÞÖ×Ø) and D490D2D3DED6D7D8
Notice how it looks wrong but it is encoded correctly. This is because the server/client, in doing the SELECT, is transcoding from latin1 (which it isn't) to cp850. (Some combination of "chcp latin1" and "SET NAMES latin1" would prevent the gibberish, but we don't want to go that direction since it would only mask the problem.)
The "?" in the gibberish is not really a question mark, but rather the result of an unconvertable character.
The ALTER must be done in two steps:
ALTER TABLE f616834A MODIFY str VARBINARY(111);
ALTER TABLE f616834A MODIFY str VARCHAR(111) CHARACTER SET cp850;
SELECT str, HEX(str) FROM f616834A; -- Expect: ÈÉÊËÌÍÎÏ and D490D2D3DED6D7D8 (cp850)
The CHARACTER SET changed, but the data was not.
Solution B: Change Data Encoding
Given: cp850 bytes are stored in a latin1 field. Goal: Leave the column's CHARACTER SET alone, but change the bytes in the column.
SHOW VARIABLES LIKE 'char%'; -- cp850 and "chcp 850" must match at this point.
-- Build a test table:
CREATE TABLE f616834B (
str VARCHAR(111) CHARACTER SET latin1
);
SHOW CREATE TABLE f616834B\G
-- Populate with accented E's and I's:
SELECT @str := HEX('ÈÉÊËÌÍÎÏ'); -- Expect D490D2D3DED6D7D8 (cp850 encoding)
SELECT UNHEX('D490D2D3DED6D7D8'); -- Expect 'ÈÉÊËÌÍÎÏ'
INSERT INTO f616834B (str) VALUES (UNHEX('D490D2D3DED6D7D8')) ;
SELECT str, HEX(str) FROM f616834B;
-- Expect gibberish (Ô?ÒÓÞÖ×Ø) and D490D2D3DED6D7D8
UPDATE f616834B SET str = CONVERT(BINARY(str) USING cp850);
SELECT str, HEX(str) FROM f616834B; -- Expect ÈÉÊËÌÍÎÏ and C8C9CACBCCCDCECF (latin1)
(A side note: CONVERT(UNHEX(HEX(str)) USING cp850) would also work.)
Fixing mix of charsets in a column
You let the users enter stuff in a UI form, and you stored it into a latin1 field. The stuff entered was in latin1, big5, utf8, symbols, etc.
You have a mess. Without deducing, row by row, what the intent was, you cannot fix it.
Fixing MicroSoft thingies
Perhaps best is to use the REPLACE() function.
⚈ replace()
Fixing "double encoding"
CAVEAT! Test before using on Production!
*** Plan A: Use UPDATE to fix in-place:
UPDATE Tbl SET col =
IFNULL(CONVERT(CONVERT(CONVERT(col USING latin1)
USING binary)
USING utf8),
col )
Caution: I have only briefly tested this.
The IFNULL has the advantage of leaving intact most cells that are not double-encoded. (Maybe -- need further testing)
How does it work?
1. The first (innermost) CONVERT takes a 4-byte encoding and converts to 2 bytes.
2. Converting to binary and then to something else is a way to change the type without changing the bytes.
3. The final convert to utf8 prevents the assignment from further messing up
4. The last step (IFNULL) is to undo the operation if it ended up invalid utf8??
*** Plan B: Use ALTER:
mysql> # Create sample:
mysql> CREATE TABLE dbl (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO dbl (c) VALUES ('x'), (UNHEX('C382C2A4'));
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x | 78 | 1 | 1 |
| A,A☼ | C382C2A4 | 4 | 4 |
+------+----------+-----------+----------------+
mysql> # Step 1 of 5:
Note: Steps 1 and 2 are needed only if you start with latin1;
If your column is already utf8, skip to step 3.
mysql> ALTER TABLE dbl MODIFY c TINYBLOB;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x | 78 | 1 | 1 |
| A,A☼ | C382C2A4 | 4 | 4 |
+------+----------+-----------+----------------+
mysql> # Step 2 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x | 78 | 1 | 1 |
| A☼ | C382C2A4 | 4 | 2 |
+------+----------+-----------+----------------+
mysql> # Step 3 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET latin1;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A☼ | C2A4 | 2 | 2 |
+------+--------+-----------+----------------+
mysql> # Step 4 of 5:
mysql> ALTER TABLE dbl MODIFY c TINYBLOB;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A☼ | C2A4 | 2 | 2 |
+------+--------+-----------+----------------+
mysql> # Step 5 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| ☼ | C2A4 | 2 | 1 |
+------+--------+-----------+----------------+
mysql> # Check schema:
mysql> SHOW CREATE TABLE dbl \G
CREATE TABLE `dbl` (
`c` varchar(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
One more note: Suggest you include the COLLATE clause on the the final ALTER that converts to VARCHAR.
*** Plan C: Using mysqldump
(This has had only limited testing.)
1. mysqldump --default-character-set=latin1 .... my_database > my_database_latin1.sql
2. Edit my_database_latin1.sql to set NAMES=utf8 at the top.
3. mysql ... < mydatabase.sqlmo
Thanks to Peter Barry
*** Plan D: A function
DELIMITER $$
CREATE FUNCTION maybe_utf8_decode(str text charset utf8)
RETURNS text CHARSET utf8 DETERMINISTIC
BEGIN
declare str_converted text charset utf8;
declare max_error_count int default @@max_error_count;
set @@max_error_count = 0;
set str_converted = convert(binary convert(str using latin1) using utf8);
set @@max_error_count = max_error_count;
if @@warning_count > 0 then
return str;
else
return str_converted;
end if;
END$$
DELIMITER ;
Then do this for each column in each table that is likely to have doubly-encoded text:
UPDATE tbl SET col = maybe_utf8_decode(col);
Thanks for Plan D to Joni Salonen for fixing-doubly-utf-8-encoded-text-in-mysql
Fix definition and data, IF correctly encoded
You have a latin1 table containing latin1 encodings; you want utf8. (This is not the general thrust of this document, but is included for completeness, and how tempting looking the command is.)
mysql> # Create sample:
mysql> CREATE TABLE trycvt (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO trycvt (c) VALUES ('x'), (UNHEX('BB'));
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
`c` varchar(11) CHARACTER SET latin1 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| » | BB | 1 | 1 |
+------+--------+-----------+----------------+
mysql> # Step 1 of 1
mysql> ALTER TABLE trycvt CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| » | C2BB | 2 | 1 |
+------+--------+-----------+----------------+
mysql> # Note schema changed:
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
`c` varchar(11) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Note: TEXT will become MEDIUMTEXT, etc. This is because transcoding latin1 to utf8 usually lengthens the data. You may want to follow with a ALTER TABLE ... MODIFY c TEXT ..esume. to put it back.
Note about ALTER: Generally it is possible and faster to do all ALTERs of a table in a single ALTER;
Haidong Ji Blog
performanceblog
Testing an in-place fix
Some of the above fixes require live changes. Here is a way that might build confidence that it works:
⚈ find some rows with accents (etc) in them
⚈ create a table with the _same_ schema; copy (INSERT SELECT) the rows into them (INSERT ... SELECT)
⚈ Try one of the conversions on that table
⚈ Check the results -- now length and char_length should be different, and should differ by the appropriate amount. Fetch the rows in you application; see if they look right.
Fixing while moving data with mysqldump
You are copying the data from, say, 4.x server to a new, say, 5.1 server. It may be easiest to convert the data as you copy. You can test the results. If necessary, you can adjust and reapply the conversion.
Overview:
mysqldump -h old_host --no-data >schema.sql
# now edit schema.sql to fix charset, collation, etc
# edit in/out SET NAMES
mysql -h new_host <schema.sql
mysqldump -h old_host --no-create-info --opt >data.sql
# edit in/out SET NAMES
mysql -h new_host <data.sql
Notes (and things that may vary with our situation)
⚈ May need --default-character-set=utf8
⚈ Other issues?
⚈ (aside) Do not dump the database information_schema
⚈ (aside) Do not dump and reload mysql if you are moving between versions; there are usually schema changes that will cause trouble.
Fixing while moving data with LOAD DATA
⚈ SELECT INTO OUTFILE may create something useful for LOAD DATA.
⚈ Perhaps use mysqldump suggestions for creating schema
⚈ LOAD DATA ... CHARACTER SET utf8 ...; (as of 5.0.38 / 5.1.17)
⚈ Sorry, no experience in this approach.
Conversion Gotchas
⚈ DROP and re-add FULLTEXT indexes around changes.
⚈ DROP and re-add all(?) FOREIGN KEY CONSTRANTs; perhaps disabling will suffice
⚈ Collation changes could lead to duplicated keys
⚈ Collation changes could lead to broken VARCHAR FOREIGN KEY CONSTRANTs
4-byte utf8
Inserting a 4-byte utf8 character into a column with CHARACTER SET utf8 will cause the string to be truncated starting with the 4-byte character.
MySQL Versions 5.1 and before do not handle utf8 characters longer than 4 bytes. Version 5.5.3 (March, 2010) has a new CHARACTER SET, utf8mb4, for utf8 encodings up to 4 bytes.
⚈ Introducing utf8mb4
⚈ Upgrading, including sample ALTERs
⚈ 'f0 a2 88 a2' showing up as 'd848 de22'
⚈ Graphemica
⚈ Winking Face
utf8mb4 in necessary for a small number of Chinese characters.
The MySQL changelog for 5.5.3 says "The Unicode implementation has been extended to provide support for supplementary characters that lie outside the Basic Multilingual Plane (BMP)."
If your product is going to store characters from Hong Kong, utf8 is not a good option since many characters in HKSCS are 4-byte utf8 characters. utf8mb4 would be better.
Good article on a variety of related issues such as SET NAMES, utf8mb4, collations, converting from utf8 to utf8mb4, index lengths, character-set-client-handshake, init-connect, Java, etc. is by Mathias Bynens
A lot of the meat is in the Comment section.
Much of this blog mentions utf8. Unless otherwise specified, utf8mb4 applies, too. Occasionally, "utf8" may refer to the standard, not specifically to MySQL's 3-byte subset.
=> F09F9881
VARCHAR(191) -- Indexes are limited to 767 bytes, which is sufficient to hold 255 utf8 bytes. However, utf8mb4 can take 4 bytes. So, the limit for building an index of a utf8mb4 column is 191. So, if you are picking 255 for no good reason, and if 191 will do, use 191. To check the max length in a field:
SELECT MAX(CHAR_LENGTH(col)) FROM tbl;
How to upgrade utf8 to utf8mb4? Probably this will do:
ALTER TABLE tbl MODIFY COLUMN col ... CHARACTER SET utf8mb4 ...;
Emoji (such as ) are all equal under utf8mb4_unicode_520_ci, but different with COLLATE utf8mb4_unicode_ci. Stackoverflow
If you currently have utf8 on all the columns in a table, and want to covert to utf8mb4,
ALTER TABLE tbl_name CONVERT TO utf8mb4;
However, this may fail for index issues. If you have an indexed VARCHAR(255), that turns out to exceed the 767-byte limit. Change it to VARCHAR(191) if the column does not really need 255. If you really need more than 191, things get more complicated. StackExchange thread
Repeat the CONVERT on the other tables.
Functions to know about
Since you may be using some of these against the source (old) database, be cautious of whether the function exists.
String functions
⚈ CHAR_LENGTH(field) -- num chars in field
⚈ LENGTH(field) -- num bytes in field
⚈ CHAR(123 USING latin1)
⚈ CHARSET(expression)
⚈ UPPER(_binary'Abc') = 'Abc' -- a subtle surprise (other string functions, too)
⚈ ORD(char) = ASCII(char) = number -- for the _char_, not the _byte_
⚈ HEX(string) -- use to see actual encoding
⚈ UNHEX('AABBCC') -- does not exist before 4.1.2
Convert functions
⚈ CONVERT('...' USING utf8)
⚈ CONVERT(blob_col USING utf8) -- no "USING" before 4.1.16
⚈ HEX(CAST(_ucs 0x00a0 as CHAR CHARACTER SET utf8)) -- show that Unicode 00a0 is utf8 C2A0
Literal introducer
⚈ _utf8'...'
PHP
⚈ mysql: mysql_set_charset('utf8');
⚈ mysqli: $mysqli_obj->set_charset('utf8');
⚈ PDO: $db = new PDO('mysql:host=...;dbname=...;charset=utf8mb4', $user, $pwd); or by executing SET NAMES utf8mb4.
Note: the use of mysqli_set_charset / set_charset should obviate the need for SET NAMES.
If you change the collation of a field, ORDER BY cannot use any INDEX; hence it could be surprisingly inefficient.
BOM * Byte-Order-Mark
Wikipedia
says
The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this.
If, instead, you see this hex: C3AFC2BBC2BF, then you are also suffering from double encoding.
If you encounter this (presumably at the start of a file), it implies that your editor is adding this, but the the reader of the file (eg, mysqldump) is does not know what to do with it. Check the editor's settings, or consider using a different editor.
Error #1064 SQL syntax at line 1 * strange characters
more on BOM
If you see The ÿþ (hex FFFE) at the beginning of a 'text' file, that is the Byte-Order-Mark.
Python issue with BOM
ERROR 1366 (HY000) at line ...: Incorrect string value: '\xEF\xBB\xBFC# BOM
open(filename, 'r', encoding='utf-8-bom')
German "sharp-s"
Version 5.1.24 introduced a change in the collation of the German "sharp-s". bug 27877
(Apr. 2007) Starting with 5.1.24, 's' = 'ß' for utf8_general_ci. 'ss' = 'ß' for utf8_unicode_ci, which is 'correct' for German.
A nasty effect of this collation change occurs when a user upgrades past 5.1.24. If there were sharp-s characters in PRIMARY (or UNIQUE) KEYs, the reloading of the data may cause "duplicate key" errors.
To 'fix' the incompatibility, the collation utf8_general_mysql500_ci was added in MySQL 5.1.62 / 5.5.21 / 5.6.5. See ref page
Graphemica
More on what to do.
(Copied here:)
New utf8_general_mysql500_ci and ucs2_general_mysql500_ci collations have been added that preserve the behavior of utf8_general_ci and ucs2_general_ci from versions of MySQL previous to 5.1.24. [CHECK TABLE](http://dev.mysql.comhttps://dev.mysql.c ... table.html) produces this error:
Table upgrade required. Please do "REPAIR TABLE t" or dump/reload to fix it!
Unfortunately, [REPAIR TABLE](http://dev.mysql.comhttps://dev.mysql.c ... table.html) could not fix the problem. The new collations permit older tables created before MySQL 5.1.24 to be upgraded to current versions of MySQL.
To convert an affected table after a binary upgrade that leaves the table files in place, alter the table to use the new collation. Suppose that the table t1 contains one or more problematic utf8 columns. To convert the table at the table level, use a statement like this:
ALTER TABLE t1 CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci;
To apply the change on a column-specific basis, use a statement like this (be sure to repeat the column definition as originally specified except for the COLLATE clause):
ALTER TABLE t1 MODIFY c1 CHAR(N) CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci;
To upgrade the table using a dump and reload procedure, dump the table using mysqldump, modify the [CREATE TABLE](http://dev.mysql.comhttps://dev.mysql.c ... table.html) statement in the dump file to use the new collation, and reload the table.
After making the appropriate changes, [CHECK TABLE](http://dev.mysql.comhttps://dev.mysql.c ... table.html) should report no error.
For more information, see [Checking Whether Tables or Indexes Must Be Rebuilt](http://dev.mysql.comhttps://dev.mysql.c ... ities.html), and [Rebuilding or Repairing Tables or Indexes](http://dev.mysql.comhttps://dev.mysql.c ... ables.html). (Bug #43593, Bug #11752408)
Where does Ö sort?
⚈ utf8_estonian_ci: Between 'W' and 'X'
⚈ danish, icelandic, swedish: After 'Z'
⚈ utf8_gernan2_ci: As if the two letters 'oe'
⚈ hungarian and turkish: Between 'O' and 'P' (that is, after 'oz')
⚈ Other collations (including unicode, unicode_520, 0900): As if the letter 'O'
These apply to both utf8 and utf8mb4, MySQL 8.0 and before.
Programatically generated collation tester
Scandanavian collations
Among utf8_danish_ci, utf8_icelandic_ci, and utf8_swedish_ci (and utf8mb4 variant):
icelandic differs due to: AZ < Á < B, E < É, IZ < Í < J, UZ < Ü < V, YZ < Ý < Z. Also
danish: zz < Ä=Æ=ä=æ < Ö=Ø=ö=ø < Aa=Å=å < Þ=þ
icelandic: zz < Þ=þ < Ä=Æ=ä=æ < Ö=Ø=ö=ø < Å=å
swedish: zz < Å=å < Ä=Æ=ä=æ < Ö=Ø=ö=ø < Þ=þ
(This may not have all the differences.) Programatically generated collation tester
Circumflex
There are many versions of "circumflex":
unicode: 005E = (^) utf8: 5E = CIRCUMFLEX ACCENT = SPACING CIRCUMFLEX
unicode: 02C6 = (ˆ) utf8: CB86 = MODIFIER LETTER CIRCUMFLEX ACCENT = MODIFIER LETTER CIRCUMFLEX
unicode: 0302 = utf8: CC82 = COMBINING CIRCUMFLEX ACCENT = NON-SPACING CIRCUMFLEX
unicode: 032D = utf8: CCAD = COMBINING CIRCUMFLEX ACCENT BELOW = NON-SPACING CIRCUMFLEX BELOW
unicode: 1DCD = utf8: E1B78D = COMBINING DOUBLE CIRCUMFLEX ABOVE
unicode: A788 = (ꞈ) utf8: EA9E88 = MODIFIER LETTER LOW CIRCUMFLEX ACCENT
unicode: FF3E = (^) utf8: EFBCBE = FULLWIDTH CIRCUMFLEX ACCENT = FULLWIDTH SPACING CIRCUMFLEX
unicode: E005E = utf8: F3A0819E = TAG CIRCUMFLEX ACCENT
This debates what latin1's 0x88 (ˆ) maps to: forum discussion
Graphemica
Quotes
+----------+---------+------+----------------------------------------------------+
| utf8mb4 | Unicode | c | Name |
+----------+---------+------+----------------------------------------------------+
| 22 | 34 | " | QUOTATION MARK |
| C2AB | 171 | « | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK |
| C2BB | 187 | » | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK |
| E28098 | 8216 | ‘ | LEFT SINGLE QUOTATION MARK | 91
| E28099 | 8217 | ’ | RIGHT SINGLE QUOTATION MARK | 92
| E2809A | 8218 | ‚ | SINGLE LOW-9 QUOTATION MARK |
| E2809B | 8219 | ‛ | SINGLE HIGH-REVERSED-9 QUOTATION MARK |
| E2809C | 8220 | “ | LEFT DOUBLE QUOTATION MARK | 93
| E2809D | 8221 | ” | RIGHT DOUBLE QUOTATION MARK | 94
| E2809E | 8222 | „ | DOUBLE LOW-9 QUOTATION MARK |
| E2809F | 8223 | ‟ | DOUBLE HIGH-REVERSED-9 QUOTATION MARK |
| E280B9 | 8249 | ‹ | SINGLE LEFT-POINTING ANGLE QUOTATION MARK |
| E280BA | 8250 | › | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK |
| E28D98 | 9048 | ⍘ | APL FUNCTIONAL SYMBOL QUOTE UNDERBAR |
| E28D9E | 9054 | ⍞ | APL FUNCTIONAL SYMBOL QUOTE QUAD |
| E29D9B | 10075 | ❛ | HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT |
| E29D9C | 10076 | ❜ | HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT |
| E29D9D | 10077 | ❝ | HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT |
| E29D9E | 10078 | ❞ | HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT |
| E29DAE | 10094 | ❮ | HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT |
| E29DAF | 10095 | ❯ | HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT |
| E3809D | 12317 | 〝 | REVERSED DOUBLE PRIME QUOTATION MARK |
| E3809E | 12318 | 〞 | DOUBLE PRIME QUOTATION MARK |
| E3809F | 12319 | 〟 | LOW DOUBLE PRIME QUOTATION MARK |
| EA9084 | 41988 | ꐄ | YI SYLLABLE QUOT |
| EFBC82 | 65282 | " | FULLWIDTH QUOTATION MARK |
| F3A080A2 | 917538 | ó ¢ | TAG QUOTATION MARK |
+----------+---------+------+----------------------------------------------------+
⚈ 91..94 are the hex for latin1 if coming from Microsoft Word.
⚈ F3A080A2 requires utf8mb4; the rest work in both utf8 and utf8mb4 (graphic probably not rendered correctly)
When is ö not equal to ö?
First, let's discuss when 'ö' is not equal to 'o'. (I will focus on utf8/utf8mb4 here.) 'o' != 'ö' for utf8_bin collation and 7 others (danish, icelandic, swedish, estonian, turkish, hungarian, and german2); see utf8collations
For MySQL 8.0 and utf8mb4, there are 12: utf8mb4collations
How could 'ö' not be equal to 'ö'? When one is a single character (hex C3B6) and the other is a combination of a 'combining diaeressis' plus 'o' -- hex CC886F.
Further discussion: Searching with German Umlauts
"Combining" (non-spacing) characters
These can be used for constructing accented letters (etc) from the code plus the letter. But keep in mind that (at least) utf8_general_ci will not treat such 2-letter codes as equal to the apparently same 1-letter equivalent.
+----------+----------------------------------------------------+---------+
| hex_utf8 | name | Unicode |
+----------+----------------------------------------------------+---------+
| CC80 | COMBINING GRAVE ACCENT | U+0300 |
| CC81 | COMBINING ACUTE ACCENT | U+0301 |
| CC82 | COMBINING CIRCUMFLEX ACCENT | U+0302 |
| CC83 | COMBINING TILDE | U+0303 |
| CC84 | COMBINING MACRON | U+0304 |
| CC85 | COMBINING OVERLINE | U+0305 |
| CC86 | COMBINING BREVE | U+0306 |
| CC87 | COMBINING DOT ABOVE | U+0307 |
| CC88 | COMBINING DIAERESIS | U+0308 |
| CC89 | COMBINING HOOK ABOVE | U+0309 |
| CC8A | COMBINING RING ABOVE | U+030A |
| CC8B | COMBINING DOUBLE ACUTE ACCENT | U+030B |
| CC8C | COMBINING CARON | U+030C |
| CC8D | COMBINING VERTICAL LINE ABOVE | U+030D |
| CC8E | COMBINING DOUBLE VERTICAL LINE ABOVE | U+030E |
| CC8F | COMBINING DOUBLE GRAVE ACCENT | U+030F |
| CC90 | COMBINING CANDRABINDU | U+0310 |
| CC91 | COMBINING INVERTED BREVE | U+0311 |
| CC92 | COMBINING TURNED COMMA ABOVE | U+0312 |
| CC93 | COMBINING COMMA ABOVE | U+0313 |
| CC94 | COMBINING REVERSED COMMA ABOVE | U+0314 |
| CC95 | COMBINING COMMA ABOVE RIGHT | U+0315 |
| CC96 | COMBINING GRAVE ACCENT BELOW | U+0316 |
| CC97 | COMBINING ACUTE ACCENT BELOW | U+0317 |
| CC98 | COMBINING LEFT TACK BELOW | U+0318 |
| CC99 | COMBINING RIGHT TACK BELOW | U+0319 |
| CC9A | COMBINING LEFT ANGLE ABOVE | U+031A |
| CC9B | COMBINING HORN | U+031B |
| CC9C | COMBINING LEFT HALF RING BELOW | U+031C |
| CC9D | COMBINING UP TACK BELOW | U+031D |
| CC9E | COMBINING DOWN TACK BELOW | U+031E |
| CC9F | COMBINING PLUS SIGN BELOW | U+031F |
| CCA0 | COMBINING MINUS SIGN BELOW | U+0320 |
| CCA1 | COMBINING PALATALIZED HOOK BELOW | U+0321 |
| CCA2 | COMBINING RETROFLEX HOOK BELOW | U+0322 |
| CCA3 | COMBINING DOT BELOW | U+0323 |
| CCA4 | COMBINING DIAERESIS BELOW | U+0324 |
| CCA5 | COMBINING RING BELOW | U+0325 |
| CCA6 | COMBINING COMMA BELOW | U+0326 |
| CCA7 | COMBINING CEDILLA | U+0327 |
| CCA8 | COMBINING OGONEK | U+0328 |
| CCA9 | COMBINING VERTICAL LINE BELOW | U+0329 |
| CCAA | COMBINING BRIDGE BELOW | U+032A |
| CCAB | COMBINING INVERTED DOUBLE ARCH BELOW | U+032B |
| CCAC | COMBINING CARON BELOW | U+032C |
| CCAD | COMBINING CIRCUMFLEX ACCENT BELOW | U+032D |
| CCAE | COMBINING BREVE BELOW | U+032E |
| CCAF | COMBINING INVERTED BREVE BELOW | U+032F |
| CCB0 | COMBINING TILDE BELOW | U+0330 |
| CCB1 | COMBINING MACRON BELOW | U+0331 |
| CCB2 | COMBINING LOW LINE | U+0332 |
| CCB3 | COMBINING DOUBLE LOW LINE | U+0333 |
| CCB4 | COMBINING TILDE OVERLAY | U+0334 |
| CCB5 | COMBINING SHORT STROKE OVERLAY | U+0335 |
| CCB6 | COMBINING LONG STROKE OVERLAY | U+0336 |
| CCB7 | COMBINING SHORT SOLIDUS OVERLAY | U+0337 |
| CCB8 | COMBINING LONG SOLIDUS OVERLAY | U+0338 |
| CCB9 | COMBINING RIGHT HALF RING BELOW | U+0339 |
| CCBA | COMBINING INVERTED BRIDGE BELOW | U+033A |
| CCBB | COMBINING SQUARE BELOW | U+033B |
| CCBC | COMBINING SEAGULL BELOW | U+033C |
| CCBD | COMBINING X ABOVE | U+033D |
| CCBE | COMBINING VERTICAL TILDE | U+033E |
| CCBF | COMBINING DOUBLE OVERLINE | U+033F |
| CD80 | COMBINING GRAVE TONE MARK | U+0340 |
| CD81 | COMBINING ACUTE TONE MARK | U+0341 |
| CD84 | COMBINING GREEK DIALYTIKA TONOS | U+0344 |
| CD85 | COMBINING GREEK YPOGEGRAMMENI | U+0345 |
| D283 | COMBINING CYRILLIC TITLO | U+0483 |
| D284 | COMBINING CYRILLIC PALATALIZATION | U+0484 |
| D285 | COMBINING CYRILLIC DASIA PNEUMATA | U+0485 |
| D286 | COMBINING CYRILLIC PSILI PNEUMATA | U+0486 |
| E28390 | COMBINING LEFT HARPOON ABOVE | U+20D0 |
| E28391 | COMBINING RIGHT HARPOON ABOVE | U+20D1 |
| E28392 | COMBINING LONG VERTICAL LINE OVERLAY | U+20D2 |
| E28393 | COMBINING SHORT VERTICAL LINE OVERLAY | U+20D3 |
| E28394 | COMBINING ANTICLOCKWISE ARROW ABOVE | U+20D4 |
| E28395 | COMBINING CLOCKWISE ARROW ABOVE | U+20D5 |
| E28396 | COMBINING LEFT ARROW ABOVE | U+20D6 |
| E28397 | COMBINING RIGHT ARROW ABOVE | U+20D7 |
| E28398 | COMBINING RING OVERLAY | U+20D8 |
| E28399 | COMBINING CLOCKWISE RING OVERLAY | U+20D9 |
| E2839A | COMBINING ANTICLOCKWISE RING OVERLAY | U+20DA |
| E2839B | COMBINING THREE DOTS ABOVE | U+20DB |
| E2839C | COMBINING FOUR DOTS ABOVE | U+20DC |
| E283A1 | COMBINING LEFT RIGHT ARROW ABOVE | U+20E1 |
| E38299 | COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK | U+3099 |
| E3829A | COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK | U+309A |
+----------+----------------------------------------------------+---------+
88 rows in set (0.01 sec)
Cautions
If you have FOREIGN KEY CONSTRAINTs on fields you are 'fixing', you will probably mess up the constraints. It might work to disable FK constraints, fix both tables, then re-enable them.
If you have FOREIGN KEY CONSTRAINTs on fields other than the ones you are 'fixing', it is probably best to disable/reenable them.
If your table is huge, and/or you have lots of indexes, the ALTERs can be very time consuming. It may be better to disable/drop the indexes during the 'fix', then reapply them at the end. Do not drop an AUTO_INCREMENT index. In the case of InnoDB, do not drop the PRIMARY KEY.
Introspection
To see what charsets and collations exist in your version of MySQL:
SHOW CHARSET; -- only latin1 and utf8/utf8mb4 are likely to be interesting
SHOW COLLATION; -- list the possible values.
SHOW COLLATION LIKE 'utf8%';
SHOW COLLATION LIKE '%_cs';
SHOW SESSION VARIABLES LIKE 'character_set%';
SHOW GLOBAL VARIABLES LIKE 'character_set%';
SHOW CREATE TABLE tbl \G
Collation name suffixes
⚈ _ci -- case insensitive; this includes ignoring accents
⚈ _cs -- case sensitive (and accent sensitive)
⚈ _bin -- just compare the bits
my.cnf
The following lines in your my.cnf initialize various things in various contexts. (It is untested.) It may fail to be recognized for SUPER user:
[client]
default-character-set = cp1252
[mysqld]
skip-character-set-client-handshake
# (deprecated in mysqld) default-character-set = cp1252
character-set-server = cp1252
collation-server = latin1_general_cs
init-connect = SET NAMES cp1252
# (init-connect is _not_ seen by SUPER users, such as "root".)
[mysqldump]
default-character-set = cp1252
[mysql]
default-character-set = cp1252
Caution: Setting default-character-set that way for the mysql client cannot be overridden by SET NAMES.
in my.ini in the [mysqld] group: skip-character-set-client-handshake collation_server=utf8_unicode_ci character_set_server=utf8 This may speed up connections, and avoid need for SET NAMES.
Stored Procedures/Functions, Triggers
ALTER DATABASE dbname CHARACTER SET ... can impact already compiled Stored Procedures. Recompile them.
There may be other charset issues. In particular, a routine seems to have a charset, but that declaration seems to have no effect.
JSON
If your JSON strings are showing things like "\uff62\u30c9\u30e9\u30b4\u30f3\u30b3\u30a4", you have "escaped unicode". (Those are unicode characters for "「ドラゴンコ".)
If you would prefer utf8, then this should work in PHP (since version 5.4):
$t = json_encode($s, JSON_UNESCAPED_UNICODE);
Before PHP 5.4:
function json_encode_unicode($input) { return preg_replace_callback( '/\\\\u([0-9a-zA-Z]{4})/', function ($matches) { return mb_convert_encoding(pack('H*',$matches[1]),'UTF-8','UTF-16'); }, json_encode($input) ); }
utf8 Collations
Which to pick? Why?
⚈ utf8_bin -- just looks at bits; no case folding, etc
⚈ utf8_general_ci -- case and accent folding.
⚈ utf8_unicode_ci -- Perhaps best. Dipthongs, combining accents, etc are handled
⚈ utf8_..._ci -- If your application is entirely (or mostly) in one language, this may be better for you. See the chart below.
Notes on utf8_general_ci vs utf8_unicode_ci
⚈ utf8_general_ci is a very simple collation. What it does * it just - removes all accents - then converts to upper case
⚈ utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss"; letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
⚈ utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.
⚈ utf8_unicode_ci is generally more accurate for all scripts.
⚈ For Cyrillic: utf8_unicode_ci is fine for these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
⚈ The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
⚈ from StackOverflow
utf8 Collations Examples
What pairs of things will be treated equal? What order does a collation cause? How do I override a column's collation? (Caution: Doing so usually prevents effecient use of an INDEX.) Examples below show altering an ORDER BY clause and an "=" operator.
Here is test code for the results below:
drop table charutf8;
-- Create and populate a table with a "utf8_bin" column:
CREATE TABLE `charutf8` (
`a` varchar(9) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT IGNORE INTO charutf8 (a) VALUES
('abc'), ('def'), ('ABC'), ('ábc'), ('́abc'), ('Abç'), ('ss'), ('ß');
-- Dump of characters, and lengths:
SELECT a, CHAR_LENGTH(a), LENGTH(a), HEX(a)
FROM charutf8
ORDER BY a;
-- CC81 is "U+0301: COMBINING ACUTE ACCENT" which,
-- together with 'a' is equal to 'á' in utf8_unicode_ci
Unicode canonical equivalence
-- Show that the ordering varies with collation:
SELECT GROUP_CONCAT(a ORDER BY a) AS 'bin order' FROM charutf8;
SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_general_ci) AS 'general_ci order' FROM charutf8;
SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_unicode_ci) AS 'unicode_ci order' FROM charutf8;
-- Self-JOIN to see what values are the same; collation-dependent:
SELECT CHAR_LENGTH(x.a) AS clen, LENGTH(x.a), x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_bin'
FROM charutf8 x
JOIN charutf8 y ON x.a = y.a
GROUP BY x.a;
SELECT x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_general_ci'
FROM charutf8 x
JOIN charutf8 y ON x.a = y.a COLLATE utf8_general_ci
GROUP BY x.a;
SELECT x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_unicode_ci'
FROM charutf8 x
JOIN charutf8 y ON x.a = y.a COLLATE utf8_unicode_ci
GROUP BY x.a;
Results and conclusions:
-- Dump of characters, and lengths:
+-------+----------------+-----------+------------+
| a | CHAR_LENGTH(a) | LENGTH(a) | HEX(a) |
+-------+----------------+-----------+------------+
| ABC | 3 | 3 | 414243 |
| Abç | 3 | 4 | 4162C3A7 |
| abc | 3 | 3 | 616263 |
| def | 3 | 3 | 646566 |
| ss | 2 | 2 | 7373 |
| ß | 1 | 2 | C39F |
| ábc | 3 | 4 | C3A16263 |
| ́abc | 4 | 5 | CC81616263 | -- the misalignment is due to the output not handling "combining" chars correctly
+-------+----------------+-----------+------------+
8 rows in set (0.00 sec)
-- CC81 is "U+0301: COMBINING ACUTE ACCENT" which,
-- together with 'a' is equal to 'á' in utf8_unicode_ci
-- Show that the ordering varies with collation:
+-----------------------------------+
| bin order |
+-----------------------------------+
| ABC,Abç,abc,def,ss,ß,ábc,́abc | -- Notice how accents follow all letters
+-----------------------------------+
mysql> SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_general_ci) AS 'general_ci order' FROM charutf8;
+-----------------------------------+
| general_ci order |
+-----------------------------------+
| ábc,abc,Abç,ABC,def,ß,ss,́abc | -- Regular accents mixed in; Combining guy is not'
+-----------------------------------+
+-----------------------------------+
| unicode_ci order |
+-----------------------------------+
| ́abc,ábc,abc,Abç,ABC,def,ß,ss | -- Combining accent sorted with its mates
+-----------------------------------+
-- Self-JOIN to see what values are the same; collation-dependent:
+-------+---------------------+
| a | Equal with utf8_bin |
+-------+---------------------+
| ABC | ABC | -- exact match for case and accents
| Abç | Abç |
| abc | abc |
| def | def |
| ss | ss |
| ß | ß | -- ss and ß are NOT equal
| ábc | ábc |
| ́abc | ́abc |
+-------+---------------------+
+-------+----------------------------+
| a | Equal with utf8_general_ci |
+-------+----------------------------+
| ABC | ábc,ABC,Abç,abc | -- case folding and accent stripping
| Abç | ábc,ABC,Abç,abc |
| abc | ábc,ABC,Abç,abc |
| def | def |
| ss | ss | -- ss and ß are NOT equal
| ß | ß |
| ábc | ABC,Abç,abc,ábc |
| ́abc | ́abc | -- 'combining' NOT equal
+-------+----------------------------+
+-------+----------------------------+
| a | Equal with utf8_unicode_ci |
+-------+----------------------------+
| ABC | ábc,ABC,́abc,Abç,abc | -- case folding and accent stripping
| Abç | abc,ábc,ABC,́abc,Abç |
| abc | Abç,abc,ábc,ABC,́abc |
| def | def |
| ss | ss,ß | -- ss and ß are EQUAL
| ß | ß,ss |
| ábc | ábc,ABC,́abc,Abç,abc |
| ́abc | abc,ábc,ABC,́abc,Abç | -- 'combining' EQUAL
+-------+----------------------------+
Changing the collation in a SELECT
Suppose you have
city VARCHAR(66) COLLATE utf8_bin
and you need to find "San José" when you do
WHERE city = "San Jose"
This applies a different collation for the comparision:
WHERE city COLLATE utf8_general_ci = 'San Jose'
Caveat: An INDEX(city) cannot be used when you override the collation, so the latter WHERE will be slower.
ERROR 1253 (42000): COLLATION '...' is not valid for CHARACTER SET '...'
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = 'rené' collate utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = 'rené' collate utf8_general_ci;
Empty set (0.00 sec)
Altenatively, use can explicitly set the character set using a 'character set introducer':
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = _utf8'rené' collate utf8_general_ci;
Empty set (0.00 sec)
StackOverflow reference
utf8 Collation Variations:
See also a more complete table for utf8 and utf8mb4 in MySQL 8.0.
Here's a comparison of the collation of the alphabet characters for the utf8 collations you might consider.
⚈ spanish2 puts ch after cz and ll after lz and Ñ,ñ after nz
⚈ utf8_general_ci treats sharp-s as 's'; the rest treat it as 'ss'.
⚈ Ligature (Æ, etc) collation varies.
⚈ Eth (Ð) is treated like a letter between D and E.
⚈ utf8_estonian_ci does some major shuffling (Z between S and T, etc).
⚈ Danish sorts Ä=Æ=ä=æ Ö=Ø=ö=ø Aa=Å=å Þ=þ after z.
utf8 : utf8_general_ci A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az
utf8 : utf8_general_mysql500_ci A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az
utf8 : utf8_danish_ci A=a=ª=À=Á=Â=Ã=à=á=â=ã=Ā=ā=Ą=ą ae az
utf8 : utf8_swedish_ci A=a=ª=À=Á=Â=Ã=à=á=â=ã=Ā=ā=Ą=ą Aa ae az
utf8 : utf8_estonian_ci A=a=ª=À=Á=Â=Ã=Å=à=á=â=ã=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_lithuanian_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_latvian_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_unicode_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_polish_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā Aa ae az Ą=ą Æ=æ
utf8 : utf8_spanish_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_spanish2_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_turkish_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_general_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz
utf8 : utf8_general_mysql500_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz
utf8 : utf8_danish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_swedish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_estonian_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_lithuanian_ci B=b C=c=ch=Ç=ç cz Č=č D=d dz Ð=ð
utf8 : utf8_latvian_ci B=b C=c=Ç=ç ch cz Č=č D=d dz Ð=ð
utf8 : utf8_unicode_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_polish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_spanish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_spanish2_ci B=b C=c=Ç=ç=Č=č cz ch D=d dz Ð=ð
utf8 : utf8_turkish_ci B=b C=c=Č=č ch cz Ç=ç D=d dz Ð=ð
utf8 : utf8_general_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_general_mysql500_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_danish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_swedish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_estonian_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_lithuanian_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_latvian_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_unicode_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_polish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ě=ě Ę=ę
utf8 : utf8_spanish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_spanish2_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_turkish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_general_ci F=f fz G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_general_mysql500_ci F=f fz G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_danish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_swedish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_estonian_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_lithuanian_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_latvian_ci F=f fz ƒ G=g=ğ Ģ=ģ H=h hz
utf8 : utf8_unicode_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_polish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_spanish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_spanish2_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_turkish_ci F=f fz ƒ G=g=Ģ=ģ H=h hz
utf8 : utf8_general_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ=ı ij iz
utf8 : utf8_general_mysql500_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ=ı ij iz
utf8 : utf8_danish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_swedish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_estonian_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_lithuanian_ci I=Y=i=y=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz=yz
utf8 : utf8_latvian_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz Y=y yz
utf8 : utf8_unicode_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_polish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_spanish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_spanish2_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_turkish_ci I=ı i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_general_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj ll lz M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_general_mysql500_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj ll lz M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_danish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_swedish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_estonian_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_lithuanian_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_latvian_ci J=j K=k Ķ=ķ L=l lj=LJ=Lj=lj ll lz Ļ=ļ Ł=ł M=m N=n=Ñ=ñ nz Ņ=ņ
utf8 : utf8_unicode_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_polish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_spanish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ņ=ņ nz Ñ=ñ
utf8 : utf8_spanish2_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj lz ll Ł=ł M=m N=n=Ņ=ņ nz Ñ=ñ
utf8 : utf8_turkish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_general_ci O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe oz
utf8 : utf8_general_mysql500_ci O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe oz
utf8 : utf8_danish_ci O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ oe=Œ=œ oz
utf8 : utf8_swedish_ci O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ oe=Œ=œ oz
utf8 : utf8_estonian_ci O=o=º=Ò=Ó=Ô=ò=ó=ô oe=Œ=œ oz Ø=ø
utf8 : utf8_lithuanian_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø
utf8 : utf8_latvian_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø
utf8 : utf8_unicode_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø
utf8 : utf8_polish_ci O=o=º=Ò=Ô=Õ=Ö=ò=ô=õ=ö oe=Œ=œ oz Ó=ó Ø=ø
utf8 : utf8_spanish_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø
utf8 : utf8_spanish2_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø
utf8 : utf8_turkish_ci O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ oe=Œ=œ oz Ö=ö Ø=ø
utf8 : utf8_general_ci P=p Q=q R=r S=s=ß=ş=Š=Š=š=š sh ss sz
utf8 : utf8_general_mysql500_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss sz
utf8 : utf8_danish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_swedish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_estonian_ci P=p Q=q R=r S=s=ş sh ss=ß sz Š=Š=š=š Z=z zh zz Ž=Ž=ž=ž
utf8 : utf8_lithuanian_ci P=p Q=q R=r S=s=ş sh ss=ß sz Š=Š=š=š
utf8 : utf8_latvian_ci P=p Q=q R=r S=s=ş sh ss=ß sz Š=Š=š=š
utf8 : utf8_unicode_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_polish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_spanish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_spanish2_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_turkish_ci P=p Q=q R=r S=s=Š=Š=š=š sh ss=ß sz
utf8 : utf8_general_ci T=t TM=tm tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_general_mysql500_ci T=t TM=tm tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_danish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz
utf8 : utf8_swedish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz
utf8 : utf8_estonian_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz
utf8 : utf8_lithuanian_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_latvian_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_unicode_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_polish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_spanish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_spanish2_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_turkish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz Ü=ü
utf8 : utf8_general_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_general_mysql500_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_danish_ci V=v W=w X=x Y=y=Ü=Ý=ü=ý=ÿ=Ÿ yz
utf8 : utf8_swedish_ci V=v W=w X=x Y=y=Ü=Ý=ü=ý=ÿ=Ÿ yz
utf8 : utf8_estonian_ci V=v W=w Õ=õ Ä=ä Ö=ö Ü=ü X=x Y=y=Ý=ý=ÿ=Ÿ yz Þ=þ
utf8 : utf8_lithuanian_ci V=v W=w X=x Ý=ý=ÿ=Ÿ
utf8 : utf8_latvian_ci V=v W=w X=x Ý=ý=ÿ=Ÿ
utf8 : utf8_unicode_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_polish_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_spanish_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_spanish2_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_turkish_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_general_ci Z=z=Ž=Ž=ž=ž zh zz Æ=æ Ð=ð Ø=ø Þ=þ Ł=ł Œ=œ ƒ LJ=Lj=lj
utf8 : utf8_general_mysql500_ci Z=z=Ž=Ž=ž=ž zh zz Æ=æ Ð=ð Ø=ø Þ=þ ß Ł=ł Œ=œ ƒ LJ=Lj=lj
utf8 : utf8_danish_ci Z=z=Ž=Ž=ž=ž zh zz Ä=Æ=ä=æ Ö=Ø=ö=ø Aa=Å=å Þ=þ
utf8 : utf8_swedish_ci Z=z=Ž=Ž=ž=ž zh zz Å=å Ä=Æ=ä=æ Ö=Ø=ö=ø Þ=þ
utf8 : utf8_estonian_ci
utf8 : utf8_lithuanian_ci Z=z zh zz Ž=Ž=ž=ž Þ=þ
utf8 : utf8_latvian_ci Z=z zh zz Ž=Ž=ž=ž Þ=þ
utf8 : utf8_unicode_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ
utf8 : utf8_polish_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ
utf8 : utf8_spanish_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ
utf8 : utf8_spanish2_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ
utf8 : utf8_turkish_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ
utf8mb4_% mostly works the same as utf8.
Unfortunately, it is not easy to separate case folding from accent stripping. Here is what someone has created: utf8_distinct_ci
8-bit Encodings
The columns are decimal, latin1 hex, utf8 hex, char, htmlentity.
128 80 E282AC € €
129 81 C281 
130 82 E2809A ‚ ‚
131 83 C692 ƒ ƒ
132 84 E2809E „ „
133 85 E280A6 … …
134 86 E280A0 † †
135 87 E280A1 ‡ ‡
136 88 CB86 ˆ ˆ
137 89 E280B0 ‰ ‰
138 8A C5A0 Š Š
139 8B E280B9 ‹ ‹
140 8C C592 Œ Œ
141 8D C28D 
142 8E C5BD Ž Ž
143 8F C28F 
144 90 C290 
145 91 E28098 ‘ ‘
146 92 E28099 ’ ’
147 93 E2809C “ “
148 94 E2809D ” ”
149 95 E280A2 • •
150 96 E28093 – –
151 97 E28094 — —
152 98 CB9C ˜ ˜
153 99 E284A2 ™ ™
154 9A C5A1 š š
155 9B E280BA › ›
156 9C C593 œ œ
157 9D C29D 
158 9E C5BE ž ž
159 9F C5B8 Ÿ Ÿ
160 A0 C2A0
161 A1 C2A1 ¡ ¡
162 A2 C2A2 ¢ ¢
163 A3 C2A3 £ £
164 A4 C2A4 ¤ ¤
165 A5 C2A5 ¥ ¥
166 A6 C2A6 ¦ ¦
167 A7 C2A7 § §
168 A8 C2A8 ¨ ¨
169 A9 C2A9 © ©
170 AA C2AA ª ª
171 AB C2AB « «
172 AC C2AC ¬ ¬
173 AD C2AD ­
174 AE C2AE ® ®
175 AF C2AF ¯ ¯
176 B0 C2B0 ° °
177 B1 C2B1 ± ±
178 B2 C2B2 ² ²
179 B3 C2B3 ³ ³
180 B4 C2B4 ´ ´
181 B5 C2B5 µ µ
182 B6 C2B6 ¶ ¶
183 B7 C2B7 · ·
184 B8 C2B8 ¸ ¸
185 B9 C2B9 ¹ ¹
186 BA C2BA º º
187 BB C2BB » »
188 BC C2BC ¼ ¼
189 BD C2BD ½ ½
190 BE C2BE ¾ ¾
191 BF C2BF ¿ ¿
192 C0 C380 À À
193 C1 C381 Á Á
194 C2 C382 Â Â
195 C3 C383 Ã Ã
196 C4 C384 Ä Ä
197 C5 C385 Å Å
198 C6 C386 Æ Æ
199 C7 C387 Ç Ç
200 C8 C388 È È
201 C9 C389 É É
202 CA C38A Ê Ê
203 CB C38B Ë Ë
204 CC C38C Ì Ì
205 CD C38D Í Í
206 CE C38E Î Î
207 CF C38F Ï Ï
208 D0 C390 Ð Ð
209 D1 C391 Ñ Ñ
210 D2 C392 Ò Ò
211 D3 C393 Ó Ó
212 D4 C394 Ô Ô
213 D5 C395 Õ Õ
214 D6 C396 Ö Ö
215 D7 C397 × ×
216 D8 C398 Ø Ø
217 D9 C399 Ù Ù
218 DA C39A Ú Ú
219 DB C39B Û Û
220 DC C39C Ü Ü
221 DD C39D Ý Ý
222 DE C39E Þ Þ
223 DF C39F ß ß
224 E0 C3A0 à à
225 E1 C3A1 á á
226 E2 C3A2 â â
227 E3 C3A3 ã ã
228 E4 C3A4 ä ä
229 E5 C3A5 å å
230 E6 C3A6 æ æ
231 E7 C3A7 ç ç
232 E8 C3A8 è è
233 E9 C3A9 é é
234 EA C3AA ê ê
235 EB C3AB ë ë
236 EC C3AC ì ì
237 ED C3AD í í
238 EE C3AE î î
239 EF C3AF ï ï
240 F0 C3B0 ð ð
241 F1 C3B1 ñ ñ
242 F2 C3B2 ò ò
243 F3 C3B3 ó ó
244 F4 C3B4 ô ô
245 F5 C3B5 õ õ
246 F6 C3B6 ö ö
247 F7 C3B7 ÷ ÷
248 F8 C3B8 ø ø
249 F9 C3B9 ù ù
250 FA C3BA ú ú
251 FB C3BB û û
252 FC C3BC ü ü
253 FD C3BD ý ý
254 FE C3BE þ þ
255 FF C3BF ÿ ÿ
The 'missing' characters in the above table:
⚈ 129
⚈ 141 reverse line feed
⚈ 143
⚈ 144
⚈ 157
⚈ 173 shows when breaking across lines Soft Hyphen
Misinterpreting the utf8 bytes as latin1. The last column is what you may be seeing.
128 80 E282AC € €
129 81 C281 Â
130 82 E2809A ‚ ‚
131 83 C692 ƒ Æ’
132 84 E2809E „ „
133 85 E280A6 … …
134 86 E280A0 † â€
135 87 E280A1 ‡ ‡
136 88 CB86 ˆ ˆ
137 89 E280B0 ‰ ‰
138 8A C5A0 Š Å
139 8B E280B9 ‹ ‹
140 8C C592 Œ Å’
141 8D C28D Â
142 8E C5BD Ž Ž
143 8F C28F Â
144 90 C290 Â
145 91 E28098 ‘ ‘
146 92 E28099 ’ ’
147 93 E2809C “ “
148 94 E2809D ” â€
149 95 E280A2 • •
150 96 E28093 – –
151 97 E28094 — —
152 98 CB9C ˜ Ëœ
153 99 E284A2 ™ â„¢
154 9A C5A1 š Å¡
155 9B E280BA › ›
156 9C C593 œ Å“
157 9D C29D Â
158 9E C5BE ž ž
159 9F C5B8 Ÿ Ÿ
160 A0 C2A0 Â
161 A1 C2A1 ¡ ¡
162 A2 C2A2 ¢ ¢
163 A3 C2A3 £ £
164 A4 C2A4 ¤ ¤
165 A5 C2A5 ¥ Â¥
166 A6 C2A6 ¦ ¦
167 A7 C2A7 § §
168 A8 C2A8 ¨ ¨
169 A9 C2A9 © ©
170 AA C2AA ª ª
171 AB C2AB « «
172 AC C2AC ¬ ¬
173 AD C2AD Â
174 AE C2AE ® ®
175 AF C2AF ¯ ¯
176 B0 C2B0 ° °
177 B1 C2B1 ± ±
178 B2 C2B2 ² ²
179 B3 C2B3 ³ ³
180 B4 C2B4 ´ ´
181 B5 C2B5 µ µ
182 B6 C2B6 ¶ ¶
183 B7 C2B7 · ·
184 B8 C2B8 ¸ ¸
185 B9 C2B9 ¹ ¹
186 BA C2BA º º
187 BB C2BB » »
188 BC C2BC ¼ ¼
189 BD C2BD ½ ½
190 BE C2BE ¾ ¾
191 BF C2BF ¿ ¿
192 C0 C380 À À
193 C1 C381 Á Ã
194 C2 C382  Â
195 C3 C383 à Ã
196 C4 C384 Ä Ã„
197 C5 C385 Å Ã…
198 C6 C386 Æ Ã†
199 C7 C387 Ç Ã‡
200 C8 C388 È Ãˆ
201 C9 C389 É Ã‰
202 CA C38A Ê ÃŠ
203 CB C38B Ë Ã‹
204 CC C38C Ì ÃŒ
205 CD C38D Í Ã
206 CE C38E Î ÃŽ
207 CF C38F Ï Ã
208 D0 C390 Ð Ã
209 D1 C391 Ñ Ã‘
210 D2 C392 Ò Ã’
211 D3 C393 Ó Ã“
212 D4 C394 Ô Ã”
213 D5 C395 Õ Ã•
214 D6 C396 Ö Ã–
215 D7 C397 × Ã—
216 D8 C398 Ø Ã˜
217 D9 C399 ٠Ù
218 DA C39A Ú Ãš
219 DB C39B Û Ã›
220 DC C39C Ü Ãœ
221 DD C39D Ý Ã
222 DE C39E Þ Ãž
223 DF C39F ß ÃŸ
224 E0 C3A0 à Ã
225 E1 C3A1 á á
226 E2 C3A2 â â
227 E3 C3A3 ã ã
228 E4 C3A4 ä ä
229 E5 C3A5 å Ã¥
230 E6 C3A6 æ æ
231 E7 C3A7 ç ç
232 E8 C3A8 è è
233 E9 C3A9 é é
234 EA C3AA ê ê
235 EB C3AB ë ë
236 EC C3AC ì ì
237 ED C3AD í Ã
238 EE C3AE î î
239 EF C3AF ï ï
240 F0 C3B0 ð ð
241 F1 C3B1 ñ ñ
242 F2 C3B2 ò ò
243 F3 C3B3 ó ó
244 F4 C3B4 ô ô
245 F5 C3B5 õ õ
246 F6 C3B6 ö ö
247 F7 C3B7 ÷ ÷
248 F8 C3B8 ø ø
249 F9 C3B9 ù ù
250 FA C3BA ú ú
251 FB C3BB û û
252 FC C3BC ü ü
253 FD C3BD ý ý
254 FE C3BE þ þ
255 FF C3BF ÿ ÿ
More references
Various discussions on the MySQL forums:
Please note that any PHP code using the mysql_* API should be replaced by the mysqli_* API; the former is deprecated.
PHP/html/mysql usage
Migrating from latin1 to utf-8
Convert database from utf8_general_ci to uft8_unicode_ci
Latin1 to unicode character convert -- data are display with ?????
Problem with accents
help in relating tables and normailsation
Select statement with town Polish characters
connection collation
cannot change character set
joined query problem
Slow retreieve under MyIsam
UTF-8 problem
trying to move a database from mysql 4.0.18 to 5.0.77
Server just hanging - Copying to tmp table
Encoding problem...
Unicode collation, utf8 data fetch/manip problem
What storage requirement has FLOAT(24) ? Doc is ambigous
what is the step to convert a db from latin1 to UTF8
Special characters and database
Require suggestion for changing character set in Master master replication setup
Character encodings issues with php, mysql, apache
load data character set question
Storing UTF-8 strings in MySQL Database
utf8 inputs store as html entities, how to retrievesubstr of utf8 inputs?
Is STR_TO_DATE function doesn't work with UCS2 charset?
Miscellany:
Converting through BLOB
Wikipedia shows 0xD5 0xF5 iso upper/lower
shows utf8 c590 / c591
details several representations for 0150
discussion about how charsets are used
MySQL 5.5 & 4-byte utf8
MSDN discussion of Unicode
unicode.org -- The definitive word, but heavy going
mysqldump issues
Example of double-encoding and mysqldump
Windows, Unicode, LOAD DATA, little-endian, etc
Charbase - A visual unicode database
Adding a custom collation
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Unicode.scarfboy
UTF8 all the way through
latin1 ~= cp1252 ~= ISO 8859-1
latin1 is the default character set. MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d. 8859 and some comparisions
Combining Diacriticals
Some characters can be encoded as either a single character or as a letter and a "combining" acceent. For example, ä can be represented in utf8 as either c3a4 or 61cc88. The ä as a single 'character' is c3a4. The other encoding is 61, the hex code for 'a', plus cc88 is the utf-8 code for the 'combining diaeresis' (umlaut, double dot, etc). Wikipedia: Combining Diacritical Marks
If the collation is utf8_general_ci, the two encodings are treated as unequal. With utf8_unicode_ci, they are equal. Searching database for strings with Swedish characters
Programatic Collation
Suppose you have need for a collation that cannot be performed by any existing collation. Create two columns instead of one; let's call them exact and canonical. exact contains the correctly spelled (caps, accents, etc) value, and it has a suitable collation. canonical is a programmatically-altered version of exact. This may include
⚈ stripping accents
⚈ folding to lowercase
⚈ removing redundant whitespace (blanks, newlines, tabs)
⚈ removing punctuation
⚈ removing markup (such as html )
⚈ it could even include providing alternate spellings or synonyms
canonical would also have an index, possibly FULLTEXT and/or utf8mb4_bin. canonical may have duplicates even though exact probably does not.
Then, when the 'user' searches, you do the same folding to "canonical" before searching canonical. Display from exact.
LOAD DATA INFILE
In LOAD DATA INFILE, the manual points out:
"The server uses the character set indicated by the character_set_database system variable to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause. A character set of binary specifies 'no conversion'."
Example:
LOAD DATA LOCAL INFILE '...'
INTO TABLE tbl
CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
ESCAPED BY '\\'
IGNORE 1 ROWS;
Characters in Western Europe
Taken from Wikipedia
å, ä and ö in Finnish and Swedish
à, ç, è, é, ï, í, ò, ó, ú, ü in Catalan
å, æ and ø in Norwegian and Danish
á, é, ó, ý, è, ë, ï in Dutch
ä, ö, ü and ß in German
á, ð, í, ó, ú, ý, æ and ø in Faroese
á, ð, é, í, ó, ú, ý, þ, æ and ö in Icelandic
à, â, ç, è, é, ë, ê, ï, î, ö, ô, ù, û, ÿ, æ, œ in French
à, è, é, ì, ò, ù in Italian
á, é, í, ñ, ó, ú, ï, ü, ¡, ¿ in Spanish
à, á, â, ã, ç, é, ê, í, ó, ô, õ, ú in Portuguese (ü no longer used)
á, é, í, ó, ú in Irish
£ in British English
Turkish
The columns: utf8, unicode codepoint decimal, hex, HTML entity, 'name':
C387 0199 00C7 Ç latin capital letter c with cedilla
C396 0214 00D6 Ö latin capital letter o with diaeresis
C39C 0220 00DC Ü latin capital letter u with diaeresis
C3A7 0231 00E7 ç latin small letter c with cedilla
C3B6 0246 00F6 ö latin small letter o with diaeresis
C3BC 0252 00FC ü latin small letter u with diaeresis
C49E 0286 011E Ğ latin capital letter g with breve
C49F 0287 011F ğ latin small letter g with breve
C4B0 0304 0130 İ latin capital letter i with dot above
C4B1 0305 0131 ı latin small letter dotless i
C59E 0350 015E Ş latin capital letter s with cedilla
C59F 0351 015F ş latin small letter s with cedilla
E282A4 8356 20A4 ₤ lira sign
E282BA 8378 20BA ₺ Turkish lira sign
Don't expect latin1 to work well:
CREATE TABLE `se96265` (
`content` mediumtext NOT NULL,
FULLTEXT KEY `content` (`content`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
SET NAMES utf8;
-- Turkish word for Croatia, second char is \xC4\xB1
INSERT INTO `se96265` (`content`) VALUES ('Hırvatistan');
SELECT content, HEX(content), HEX('Hırvatistan') FROM se96265\G
*************************** 1. row ***************************
content: H?rvatistan
HEX(content): 483F72766174697374616E
HEX('Hırvatistan'): 48C4B172766174697374616E
Note how the 'C4B1' turned into '?'
To get an error:
mysql> SET SESSION sql_mode = 'STRICT_ALL_TABLES'; -- So the INSERT will cause an error:
mysql> INSERT INTO `se96265` (`content`) VALUES ('Hırvatistan');
ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1
(Note: This technique for getting an error message is not specific to Turkish.)
ISO/IEC 8859-9 = latin5 = Turkish
If you have Turkish text in a latin1 column/table, you should convert to latin5 in order to correctly handle the 6 characters that are different (latin1: ÐÝÞðýþ --> latin5 ĞİŞğış). See Stackoverflow
Arabic/Farsi
Arabic characters 2 bytes in utf8: Dxyy. For example, Arabic Letter 'HAH' is D8AD.
To get diacritic insensitive comparisons, use utf8_unicode_ci:
SELECT 'بِسْمِ' = 'بسم' COLLATE utf8_unicode_ci;
The result I got back was 1 (true), meaning they are considered equal. With utf8_general_ci it came back with 0, meaning not equal. (Of course, you should declare your column(s) as utf8_unicode_ci rather than using the COLLATE clause.)
Stored Farsi, got '???'
Urdu case folding
Hebrew
To search for a particular character in a string, LIKE or REGEXP will work.
To search for "a cell being entirely Hebrew" is not easy. Essentially, you look for the utf8 encoding, then build a REGEXP to accomodate it. For example, to see which rows have a Hebrew name (in utf8):
WHERE HEX(name) REGEXP '^(D7..)+$'
See forum discussion
This assumes (not quite perfectly) that all Hebrew encodings are D7xx. This can be adapted to other languages if the encoding is not to complex. Chinese, for example, is complex:
WHERE HEX(name) REGEXP '^(E.....)+$'
⚈ Chinese characters are 3 bytes (6 hex) long
⚈ This fails to handle the 4-byte Chinese characters
⚈ This includes Japanese, Cherokee, etc.
If you have the "hebrew" encoding in a latin1 column, you may need a 3-step ALTER. See stackoverflow
Storing Hebrew strings Ok, but gibberish when pulling data on another server
Cyrillic, Macedonian, etc
(Random notes)
Cyrillic in utf8 is encoded D0yy-D4yy and EA99yy-EA9Ayy, but perhaps only D0yy is in common use.
For Cyrillic: utf8_unicode_ci is fine for these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
Ќ is hex D08C in utf8 or utf8mb4. Cyrillic is completely covered by either CHARACTER SET. К is D09A. Hmmm, this is quite interesting:
mysql> SELECT 'К' = 'Ќ' COLLATE utf8_bin AS bin,
'К' = 'Ќ' COLLATE utf8_general_ci AS general,
'К' = 'Ќ' COLLATE utf8_unicode_ci AS unicode;
+-----+---------+---------+
| bin | general | unicode |
+-----+---------+---------+
| 0 | 1 | 0 |
+-----+---------+---------+f
mysql> SELECT 'Г' = 'Ѓ' COLLATE utf8_bin AS bin,
'Г' = 'Ѓ' COLLATE utf8_general_ci AS general,
'Г' = 'Ѓ' COLLATE utf8_unicode_ci AS unicode;
+-----+---------+---------+
| bin | general | unicode |
+-----+---------+---------+
| 0 | 1 | 0 |
+-----+---------+---------+
And if it helps, here's the method from the alphabet helper. def alphabet cyrillic_alphabet = ["А", "Б", "В", "Г", "Д", "Ѓ", "Е", "Ж", "З", "Ѕ", "И", "Ј", "К", "Л", "Њ", "М", "Н", "Љ", "О", "П", "Р", "С","Т", "Ќ", "У", "Ф", "Х", "Ц", "Ч", "Џ", "Ш"] end – NewbieOnRails yesterday
These are letters in Macedonian alphabet, not on the Ubuntu Cyrillic keyboard.
⚈ D086 1030=x0406 [І] L CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
⚈ D087 1031=x0407 [Ї] L CYRILLIC CAPITAL LETTER YI
⚈ D08B 1035=x040B [Ћ] L CYRILLIC CAPITAL LETTER TSHE
⚈ D08E 1038=x040E [Ў] L CYRILLIC CAPITAL LETTER SHORT U
The bit patterns are different, so utf8_bin collates unequal. But usually whenever general is equal, so is unicode. Cyrillic/Macedonian
Cyrillic example:
title: Добро пожаловать в Диксон,
hex: D094D0BED0B1D180D0BE20D0BFD0BED0B6D0B0D0BBD0BED0B2D0B0D182D18C20D0B220D094D0B8D0BAD181D0BED0BD,
length: 47,
char length: 25
Data entry of Cyrillic in Workbench yields "????"
Other Natural Languages
This is a list hits about natural languages. Most have not been verified.
Lithuanian ąĄ čČ ęĘ ėĖ įĮ ųŲ ūŪ are saved as ?, however, šŠ žŽ do get saved. The reason: Column declared latin1
Vietnamese collations were added in MySQL 5.6.
em dash, dagger, and other special characters
mysql> SELECT ';' semicolon, UNHEX('cdbe') 'Greek question mark';
+-----------+---------------------+
| semicolon | Greek question mark |
+-----------+---------------------+
| ; | ; |
+-----------+---------------------+
cp1250 - Central Europe
For a Devanagari character range:
SELECT 'क़ज़फ़' REGEXP '^[क़-फ़]+$'; --> true -- But that is mere coincidence!
Stackoverflow thread
UTF8 Chinese String Comparison
REGEXP / RLIKE
Do not attempt to use most of the features of REGEXP to test for non-ascii characters; it will generally not work in utf8. MySQL's REGEXP only works with bytes, not utf8 characters.
Greek alpha (α) is hex CEB1; omega (ω) is hex CF89
* Character sets don't work right: [αω] checks for 4 bytes: CE, B1, CF, 89, which has very little to do with checking for those two letters. * Ranges also don't work right: [α-ω] checks for byte CE, range B1-CF, and byte 89. * . is one byte, not one character. * etc.
A few tricks will work:
* WHERE HEX(str) REGEXP '(..)*CEB1CF89' -- will check for 'αω' in the column; but you may as well do WHERE str LIKE '%αω%'. * WHERE HEX(str) REGEXP '(..)*F' -- will check for whether the UTF-8 string needs utf8mb4. (But how did you get str stored in the first place??)
Python
1st or 2nd line in source code (to have literals in the code utf8-encoded): # -*- coding: utf-8 -*-
Python code for dumping hex (etc) for string 'u':
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
Miscellany notes on coding for utf8:
⚈ db = MySQLdb.connect(host=DB_HOST, user=DB_USER, passwd=DB_PASS, db=DB_NAME, charset="utf8", use_unicode=True)
⚈ conn = MySQLdb.connect(host="localhost", user='root', password='', db='', charset='utf8')
⚈ config = 'OPTIONS': { ..., 'charset': 'utf8mb4', ... } -- in "settings.py"?
⚈ cursor.execute("SET NAMES utf8mb4;") -- not as good as using `charset'
⚈ db.set_character_set('utf8'), implies use_unicode=True
⚈ Literals should be u'...'
⚈ MySQL-python 1.2.4 fixes a bug wherein varchar(255) CHARACTER SET utf8 COLLATE utf8_bin is treated like a BLOB.
Checklist:
⚈ # -*- coding: utf-8 -*- -- (for literals in code) (This tells the interpreter which encoding to use when parsing the file, it affects only files that contain non-ascii characters (literals). It does not effect the code in any other way.)
⚈ charset='utf8' in connect() call -- Is that buried in bottle_mysql.Plugin? (Note: Try 'utf-8' and 'utf8')
⚈ Text encoded in utf8.
⚈ No need for encode() or decode() if you are willing to accept utf8 everywhere.
⚈ u'...' for literals
⚈ <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> near start of html page
⚈ <meta charset="utf-8" /> is a shorter version (if using only HTML5).
⚈ Content-Type: text/html; charset=UTF-8 (in HTTP response header)
⚈ header('Content-Type: text/html; charset=UTF-8'); (in PHP to get that response header)
⚈ CHARACTER SET utf8 COLLATE utf8_general_ci on column (or table) definition in MySQL.
⚈ utf8 all the way through
⚈ Use MySQL Connector/Python instead of pyodbc and MySQL Connector/ODBC
Side note: Python defaults to autocommit=off; this can be a rude surprise when your INSERTs vanish.
Showing hex: var.encode().hex()
References:
⚈ https://docs.python.org/2/howto/unicode ... icode-type
⚈ https://stackoverflow.com/questions/915 ... ding-mysql
⚈ https://dev.mysql.com/doc/connector-pyt ... targs.html
⚈ https://stackoverflow.com/questions/340 ... e-bytecode
How to store arabic text in mysql database using python?
The Python language environment officially only uses UCS-2 internally since version 2.0, but the UTF-8 decoder to "Unicode" produces correct UTF-16. Since Python 2.2, "wide" builds of Unicode are supported which use UTF-32 instead;[16] these are primarily used on Linux. Python 3.3 no longer ever uses UTF-16, instead strings are stored in one of ASCII/Latin-1, UCS-2, or UTF-32, depending on which code points are in the string, with a UTF-8 version also included so that repeated conversions to UTF-8 are fast.
Python check for utf8 (versus utf8mb4)
The python package "MySQL-python" version needs to be at least 1.2.5 in order to handle utf8mb4.
import sys
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Python 2.7 issues; improvements in Python 3
json.dumps(mydict, ensure_ascii=False) avoids "\u...." strings.
If you have \u00ef type text, consider this for fixing it: codecs.decode(body, 'unicode-escape')
utf-8-sig
From codecs — Codec registry and base classes
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
PHP
⚈ If using mysqli, do $mysqli_obj->set_charset('utf8mb4');
⚈ If using PDO do somethin like $db = new PDO('dblib:host=host;dbname=db;charset=utf8mb4', $user, $pwd);
⚈ Alternatively, execute SET NAMES utf8mb4
Any of these will say that the bytes in the client are UTF-8 encoded. Conversion, if necessary, will occur between the client and the database if the column definition is something other than utf8mb4.
Be sure to have php-mbstring installed (see php.ini)
Random notes about PHP. (See also Best Practice, above.)
In php.ini: default_charset UTF-8 The default changed to UTF-8 in PHP 5.6.
To see what PHP has:
$hex = unpack('H*', $text);
echo implode('', $hex);
or simply echo bin2hex($text);
To get "U+" (unicode) notation, first fetch the data, then use PHP code trim(str_replace('\u', ' U+', json_encode('كتب عربية')), '" '); This will result in U+0643 U+062a U+0628 U+0639 U+0631 U+0628 U+064a U+0629.
PHP: Convert \unnnn to utf8:
$myr = '\u101c\u1031\u1011'; // Given this
$x = json_decode('"'.$myr.'"'); // Add quotes and decode
// $x is not string, like လေထ
If you use header(), do: header('Content-type: text/plain; charset=utf-8');
PHP: undo_magic_quotes_gpc($_POST);
BjyProfiler needs charset=utf8; it apparently overrides other PHP settings.
If using mysqli, character_set_name() (formerly spelled client_encoding()) does NOT mirror SET NAMES.
$doc->loadHTML(''. $content);
Other Computer Languages
This is a list of hints on how to establish utf8 for various computer languages. Most have not been verified.
ADO.NET "charset=utf8"
AJAX AJAX
encodeURIComponent() may be useful.
Apache servers are (were?) configured to serve files in ISO-8859-1 by default, so you need to add the following line to your .htaccess file:
AddDefaultCharset UTF-8
ASP, ASP.Net
<%Response.charset="utf-8"%>
Mathias Bynens
Various tips
including:
Apache Config (/etc/httpd/conf/httpd.conf) :
AddDefaultCharset UTF-8 Apache .htaccess file :
AddCharset UTF-8 .htm AddCharset UTF-8 .html AddCharset UTF-8 .php
Connector/C
mysql_set_character_set('utf8mb4');
C#: C# code to undo (while reading?) double-encoded strings
Similar code in SQL:
CONVERT(UNHEX(HEX(CONVERT(CONVERT(col USING utf8) USING latin1))), CHAR)
Put this in the connection string:
id=my_user;password=my_password;database=some_db123;charset=utf8;
Lao and C#
Console.OutputEncoding = System.Text.Encoding.UTF8;
Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword; CharSet=utf8; Note! Use lower case value utf8 and not upper case UTF8 as this will fail. Connector/Net
codeigniter:
application/config/config.php - Note the dash :
$config['charset'] = 'UTF-8';
application/config/database.php :
$db['default'] = array(
...
'char_set' => 'utf8mb4',
'dbcollat' => 'utf8mb4_unicode_520_ci',
Note: Those are aimed at MySQL. dbcollat is used only for PHP's mysqli.
httppost.setEntity(new UrlEncodedFormEntity(nameValuePairs, HTTP.UTF_8));
echo meta('Content-type', 'text/html; charset=utf-8', 'equiv');
In database.php, Set char_set to utf8mb4 and dbcollat to utf8mb4_unicode_ci or utf8_general_ci Codeigniter 3.0 notes
ColdFusion
<cfprocessingdirective pageEncoding="utf-8">
<cffile
action="read"
file="#settings.csvfile#"
variable="autodata"
charset="utf-8">
Django
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql',
...
'OPTIONS': {
'charset': 'utf8mb4',
'use_unicode': True, },
},
}
my.cnf:
[mysqld]
character-set-server=utf8mb4
default-collation=utf8mb4_unicode_ci
[client]
default-character-set=utf8mb4
StackOverflow
client_encoding: 'UTF8' Django utf8 usage
The Django container (we use docker) was missing the LANG env var. Setting LANG to en_US.UTF-8 and restarting the container fixed the problem.
See also the Python notes, above.
Drupal, WordPress, and any other product that uses PHP: Turn off mbstring.encoding_translation. Probably everything to do with mbstring should be avoided.
amending your settings.php database connection by adding:
$databases['default']['default'] = array(
'driver' => 'mysql',
'database' => 'databasename',
'username' => 'username',
'password' => 'password',
'host' => 'localhost',
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_general_ci',
);
More discussion of utf8mb4
go-mysql
[username[:password]@][protocol[(address)]]/dbname?collation=utf8mb4_unicode_ci https://github.com/go-sql-driver/mysql#charset
Grails Grails and utf8mb4
Remove the "characterEncoding=UTF" part in the dataSource url. Otherwise grails always forces MySQL to use UTF-8 instead of utf8mb4. E.g.
dataSource:
dbCreate: "update"
url: "jdbc:mysql://localhost:8889/mydbname?useUnicode=true&zeroDateTimeBehavior=convertToNull&autoReconnect=true"
Heidi
When importing, don't use "auto-detect", explicitly request "UTF-8" if appropriate. Heidi import
Java
Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0. ucs2 versus utf16
⚈ What if you put a 5.1 slave off a 4.x master? (The German sharp-s could cause trouble.)
⚈ Replication type (SBR vs RBR)?
Add ?useUnicode=yes&characterEncoding=UTF-8 to the JDBC URL (or maybe it is =true and =utf8)
<%@ page language="java" pageEncoding="UTF-8"%>
<%@ page contentType="text/html;charset=UTF-8" %>
compileJava.options.encoding = 'UTF-8'
<form method="post" action="/your/url/" accept-charset="UTF-8">
To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4. Connector/J will then use that setting as long as characterEncoding has not been set in the connection string. This is equivalent to autodetection of the character set.
utf8mb4 needs special handling in Connector/J
"For Connector/J 8.0.12 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for characterEncoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.'
Jetty
jetty
Java Servlet
resource.setContentType ("text/html;charset=utf-8");
JSP
<%@ page contentType="text/html; charset=UTF-8" %>
Laravel
If you have trouble with 191, etc in Laravel, you have these choices
⚈ 255 -> 191 (but make sure you don't currently have longer addresses)
⚈ utf8mb4 -> utf8 (Caution: This disallows Emoji and some Chinese characters) See config/database.php
⚈ INDEX(email(20)) (If it is UNIQUE or PRIMARY KEY, do not pick this option)
⚈ It is possible to reconfigure (if after 5.6.3) the server and table to allow bigger indexes; this is not the 'default' until 5.7.7.
Stackoverflow
In the file config/database.php: 'mysql' => [..., 'charset' => 'utf8mb4', 'collation' => 'utf8mb4_unicode_ci', ...]
If you would like to log the MYSQL of the query you can use this: \DB::enableQueryLog(); before the query and this: logger(\DB::getQueryLog() ); afterwards. This will save the query to your laravel log file Navicat
Old versions of Navicat have an option for utf8, but not for utf8mb4. So, if you need utf8mb4,... In the encoding options for navicat connections, pick "use mysql encoding", not "utf8". Forum
mysql commandline tool mysql --default-characterset=utf8mb4 and (or)
[mysql]
default-character-set = utf8mb4
StackOverflow
NET
Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword;CharSet=utf8;
node.js var connection = mysql.createConnection({ ... , charset : 'utf8mb4'}); stackoverflow
utf8 package
OrmLite ConnectionReset=False;
Powershell ISE and Powershell Console: $OutputEncoding = New-Object -typename System.Text.UTF8Encoding
library(RODBC)
imp <- odbcConnect("SERVER", uid = "***", pwd="***",DBMSencoding="utf8")
ODBC 3.5.1 probably won't handle utf8; either of these may be correct:
ODBC;DRIVER={MySQL ODBC 5.1 Driver};UID=...;PWD=...;PORT=3306;DATABASE=...;SERVER=127.0.0.1;CHARSET=UTF8;
strConnection="driver={MySQL ODBC 5.1 Driver}; server=...;uid=...;pwd=...;database=...;stmt=SET NAMES utf8;"
Unicode is not supported in ADO applications (including ASP) with the Connector/ODBC 3.51 driver. Please consider testing MySQL Connector/ODBC 5.1 which includes Unicode support for ASP.
Panda:
Stackoverflow on Panda
See also notes on Python.
df.to_csv('file.csv',encoding='utf-8-sig')
Panda Release 0.17.1 [on Nov 13, 2015](#event-463745727) fixes an issue with non-ascii characters in column (etc) names.
Perl:
use utf8;
use open ':std', ':encoding(UTF-8)';
my $dbh = DBI->connect("dbi:mysql:".$dsn, $user, $password, {
PrintError => 0,
RaiseError => 1,
mysql_enable_utf8 => 1, # Switch to UTF-8 for communication and decode.
});
# or {mysql_enable_utf8mb4 => 1} if using utf8mb4
Stackoverflow - double encoding and JSON
Perl and UTF-8
A workaround for an issue in Perl 5: binmode(STDOUT, ':unix:encoding(utf8):crlf');
PHP see above
pymysql:
import pymysql
con = pymysql.connect(host='127.0.0.1', port=3306,
user='root', passwd='******',
charset="utf8mb4")
Python see above
R / RStudio
Tool -> Global Options -> Code -> Saving and put UTF-8
rs <- dbSendQuery(con, 'set character set "utf8"')
rs <- dbSendQuery(con, 'SET NAMES utf8')
options(encoding = "UTF-8") at the top of my main script from which I call my package seems to fix the issue with having non-ascii characters in my package code.
read_chunk(lines = readLines("TestSpanishText.R", encoding = "UTF-8")) (also file())
[https://support.rstudio.com/hc/en-us/ar ... r-Encoding][Character Encoding]]
Rails: Rails and utf8mb4
Ruby on Rails:
"\xD0\x9A\xD0\xB0\xD1\x81\xD0\xBB".force_encoding("UTF-8") => "Щасл"
Watch out for rake db:reset . Consider, instead, rake db:drop && rake db:create && rake db:migrate . The reset just created the tables from the already stored schema which had the wrong charset and collation. But re-creating them from the migrations created them with the new charset and collation.
RoR: application_controller.rb
def configure_charsets
response.headers["Content-Type"] = "text/html; charset=utf-8"
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end
end
Edit your database.yml with below code line encoding: utf8
RocksDB, MyRocks not yet answered
ServiceStack ConnectionReset=False;
Spring/Hibernate: (See also Best Practice, above.)
Hibernate XML:
<property name="hibernate.connection.CharSet">utf8mb4</property>
<property name="hibernate.connection.characterEncoding">UTF-8</property>
<property name="hibernate.connection.useUnicode">true</property>
Connection url:
db.url=jdbc:mysql://localhost:3306/db_name?useUnicode=true&character_set_server=utf8mb4
The above changes were enough for me to upgrade from utf8 to utf8mb4 charset scheme.
As a side note I would like to make one clarification that UTF-8 is the character encoding while utf8mb4 is a character set that MySQL supports. MySQL's utf8mb4 is a superset to MySQL's utf8.
Spring/Hibernate filter:
<form accept-charset="UTF-8">
Spring/Hibernate: <property name="url" value="jdbc:mysql://localhost:3306/miniprojetjee?useUnicode=true&connectionCollation=utf8_general_ci&characterSetResults=utf8&characterEncoding=utf-8"/> (or maybe it is =yes)
"Spring": @RequestMapping(value = "/getRegion2", produces={"application/json; charset=UTF-8"},method = RequestMethod.GET)
Hibernate manual
Spring MVC UTF-8 Encoding
<filter>
<filter-name>encoding-filter</filter-name>
<filter-class>
org.springframework.web.filter.CharacterEncodingFilter
</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
<init-param>
<param-name>forceEncoding</param-name>
<param-value>true</param-value>
</init-param>
</filter>
Spring Boot
Make sure you register Spring's CharacterEncodingFilter in your web.xml
Load the lib hibernate-spatial in pom.xml, and set the config spring.jpa.properties.hibernate.dialect: org.hibernate.dialect.MySQL5InnoDBDialect in application.yml Reference
sqlalchemy:
db_url = sqlalchemy.engine.url.URL(drivername='mysql', host=foo.db_host,
database=db_schema,
query={ 'read_default_file' : foo.db_config, 'charset': 'utf8mb4' })
MySQL-Unicode in sqlalchemy
engine = create_engine('mysql://root:@localhost/testdb?charset=utf8', encoding = 'utf-8')
sqoop
sqoop import --connect jdbc:mysql://server.foo.com/db --table bar --direct -- --default-character-set=latin1 (Probably the latin1 should be replaced by the encoding you have.)
Tomcat
tomcat
Wildfly: Configuration -> Subsystems -> Undertow -> Servlet/JSP -> View -> Set "User listener encoding" to true.
WordPress: in wp-config.php, define('DB_CHARSET', 'utf8mb4'); and define('DB_COLLATE', ''); WordPress & utf8mb4
Tomcat/servlet/JSP/etc settings
vb.net / VBScript:
Dim MyConn As New MySqlConnection("Server=localhost;User Id=u;Password=x;Database=d;Charset=utf8")
<%@language="VBSCRIPT" CODEPAGE="65001" LCID=1033%>
Console.OutputEncoding = System.Text.Encoding.UTF8
web2py
db = DAL('mysql://username:password@localhost/test?set_encoding=utf8mb4')
web.xml
web.xml
Hex 80-9F of cp1252 (Windows-1252 vs ISO-8859-15) latin1 codes
MySQL 8.0
(Source)
This is out of date
Collations to add:
We are going to add following collations. The collation name contains:
a. character set name: "utf8mb4"
b. language's iso code: for example, "cs" for Czech
c. UCA version: "800"
d. accent / case insensitive: "ai_ci"
The complete list is:
Collation name language
------------------------------------------------------------
utf8mb4_cs_800_ai_ci Czech
utf8mb4_da_800_ai_ci Danish
utf8mb4_de_phonebook_800_ai_ci German (phonebook order)
utf8mb4_eo_800_ai_ci Esperanto
utf8mb4_es_800_ai_ci Spanish
utf8mb4_es_traditional_800_ai_ci Spanish (traditional)
utf8mb4_et_800_ai_ci Estonian
utf8mb4_hr_800_ai_ci Croatian
utf8mb4_hu_800_ai_ci Hungarian
utf8mb4_is_800_ai_ci Icelandic
utf8mb4_la_800_ai_ci Roman (classical Latin)
utf8mb4_lt_800_ai_ci Lithuanian
utf8mb4_lv_800_ai_ci Latvian
utf8mb4_pl_800_ai_ci Polish
utf8mb4_ro_800_ai_ci Romanian
utf8mb4_sk_800_ai_ci Slovak
utf8mb4_sl_800_ai_ci Slovenian
utf8mb4_sv_800_ai_ci Swedish
utf8mb4_tr_800_ai_ci Turkish
utf8mb4_vi_800_ai_ci Vietnamese
Instead, see (as of 8.0.0) utf8mb4 collations
Random notes
UTF-8 all the way through
Scan for 8-bit codes, find and decipher double-encoding
A table declared to be latin1, and containing latin1 bytes can be converted to utf8 via
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8;
Example from forums
Using Replication to assist in charset conversion?
utf8_general_ci, which will treat Résumé the same as resume MYSQL doesn't 'see' accents
Exact match when _ci collation is on column - even using index
WHERE keyword = CONVERT('cho' USING binary)
Re: constants in SQL:
EXPLAIN EXTENDED SELECT ...;
SHOW WARNINGS;
sometimes gives extra clues into what charset/collation issues are being faced during a SELECT. It will probably contain 'introducers'; example: _utf8'abc'.
To ponder: convert(convert((binary col1) using latin1) using utf8)
phpMyAdmin might be connecting as "root", thereby bypassing init_connect = "SET NAMES utf8". This could defeat your attempts to be fully utf8. Possible workaround: Use a login that avoids SUPER priv. (This suggestion is good advice for all applications -- for security reasons.)
# Find rows with any 8-bit chars:
SELECT * FROM tbl WHERE HEX(colname) RLIKE '^(..)*[8-F].';
# Remove all occurrences of one 8-bit code (hex 92, in this example):
UPDATE tbl SET colname = REPLACE(colname, UNHEX('92'), '');
Reference: how to replace hex in varchar field
Converting latin1 hex to utf8 hex: HEX(CONVERT(CONVERT(UNHEX('F6') USING LATIN1) USING UTF8)) ⇒ C3B6
(Note: The doc refers to version 5.5, but probably applies to all versions from 4.1 on.)
Connector/.NET possibly defaults to latin1. Consider changing 'connection string' if you need utf8.
Suggested for my.cnf by stackoverflow
stackoverflow
[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
character-set-server = utf8
[mysqld]
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8
Caution: Setting default-character-set that way for the mysql client cannot be overridden by SET NAMES.
my.cnf: skip-character-set-client-handshake to pretend to be 4.0?
A possible double-encoding fix
$string = htmlspecialchars($string, ENT_QUOTES, "UTF-8");
my.cnf needs these to force the server and the client to use UTF8:
skip-character-set-client-handshake
collation-server=utf8_unicode_ci
character-set-server=utf8
However, "The server character set and collation are used as default values if the database character set and collation are not specified in CREATE DATABASE statements. They have no other purpose." manual on charset-server
It is not clear whether skip-character-set-client-handshake is really needed.
For a description of the character_set_* settings and SET NAMES: manual on Connection Character Sets
In 5.6, this helps get Unicode rendering for mysql commandline tool for Windows: mysql --default-character-set=utf8 Windows Unicode Support
Also sugest (in my.ini)
[mysql]
default-character-set=utf8
<?xml version="1.0" encoding="utf-8"?>
REGEXP (RLIKE) with character class is likely to work wrong: Bug report from 2007
Manual Warning
<form method="post" action="/your/url/" accept-charset="UTF-8">
Changing default charset
Possible values for <meta http-equiv="Content-Type" content="text/html; charset=..." /> can be found in iana charsets
For example, use 'ISO-8859-9' for MySQL's 'latin5'.
How to fix a table that has a mixture of latin1 and utf8
Test for Chinese characters: REGEXP '^(..)*(E[2-9F]|F0A)' -- Caveat: This may not be precise. StackOverflow discussion
In my.cnf:
collation_server = utf8mb4_unicode_520_ci
sets
character_set_database | utf8mb4
character_set_server | utf8mb4
collation_database | utf8mb4_unicode_520_ci
collation_server | utf8mb4_unicode_520_ci
but not the things set by SET NAMES, nor collation_connection.
From language name to list of characters (dwell to see unicode, not utf8)
Illustrive examples of ALTERs
5.5.3 (10/2010) Changelog: The following obsolete constructs have been removed. The --default-character-set and --default-collation server options (use --character-set-server and --collation-server). Then...
5.7.6 (9/2015) Changelog: The following items are deprecated and will be removed in a future MySQL release. The global character_set_database and collation_database system variables are deprecated and will be removed in a future version of MySQL. Assigning a value to the session character_set_database and collation_database system variables is deprecated and assignments produce a warning. The session variables will become read only in a future version of MySQL and assignments will produce an error. It will remain possible to access the session variables to determine the database character set and collation for the default database.
LOAD DATA added syntax to include charset valid for the file to be loaded. In 4.1, LOAD DATA utilises database charset. Bug 10195
Check for capitalized Swedish words
shows the difficulty of using REGEXP, and provides a workaround.
When a Stored Routine is created, it has the current CHARACTER SET and COLLATION attached to it. Beware -- this may not be what you expect later. Use SHOW PROCEDURE name to see the setting.
"ch and ll are no longer letters since 1994" - This probably means that "spanish2" collations are not needed for current texts. Reference
(in Spanish).
LIKE != = -- (a LIKE b) is almost the same as (a = b) in that collations matter. However trailing spaces can lead to different actions. And 2-byte letters (ae) are treated as two characters in LIKE, but perhaps one character, such as when compared to (ä) with utf8_german2_ci. SO Discussion
Diacritics used in Europe
$str = mb_convert_encoding($str, 'UTF-8', 'HTML-ENTITIES');
Changing case
Examples of mismatch of utf8 and utf8mb4
Search for emoji
Concerning LOAD DATA INFILE: The server uses the character set indicated by the character_set_database system variable to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause. A character set of binary specifies “no conversion.”
Transliterate accented and Cyrilic characters into latin
A messy case where double-encoding was visible
Stored function to decode unicode escapes
ALTER TABLE mytable DEFAULT CHARACTER SET utf8 can create double encoding
If having trouble importing utf8mb4 characters, give this a try:
USE my_example_db_name;
# Select the right charset
SET NAMES 'utf8mb4';
# Import from SQL file
SOURCE /somewhere/dump.sql;
# Disconnect from SQL server
EXIT
Footnote
If this solves your problem, please let me know. My ego needs a pat on the back. If it does not solve your problem, please contact me, maybe I can pull as special rabbit out of my hat, just for you.
Original writing -- Sep 2010, after struggling with hundreds of questions both in-house and from the forums. Alas, this does not achieve the ultimate goal of "Given this symptom, do that". (See 2015) I hope it helps most people.
Revision date: 2012-08-26 (This document continues to evolve as more problems and examples are discovered.)
Revision date: 2013-02-12 (Misc, plus added 8-bit encodings)
Revision date: 2015-March/April Finally getting close to "If you see this, this is what happened, and here's how to fix it". Add better summary of what errors occur and how to fix them. Add/fix several links.
Revision date: 2016-01-25 Shuffled language info.
Revision date: 2016-09-14 Beginnings of 8.0, including utf8mb4...unicode_800_ci
-- Rick James
MySQL Documents by Rick James
HowTo Techniques for Optimizing Tough Tasks:
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
Overview Summary Tables High speed ingestion
Entity-Attribute-Value -- a common, poorly performing, design pattern (EAV); plus an alternative
5 methods for 'Find Nearest'
Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
Lat/Long representation choices
Z-Order 'find nearest'(under construction)
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Rollup Unique User Counts
Alter of a Huge table -- Mostly obviated by 5.6
Latest 10 news articles -- how to optimize the schema and code for such
Build and execute a "Pivot" SELECT (showing rows as columns)
Find largest row for each group ("groupwise max")
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8 If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com .
Python tips, PHP tips, other language tips
utf8 Collations utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table wp_postmeta UNION+OFFSET
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Request for tuning / slowlog info
Best of MySQL Forum -- index of lots of tips, discussions, etc
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS Analyze SlowLog
My slides from conferences
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
(older ones upon request)
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: View Rick James's profile on LinkedIn
Did my articles help you out? Like what you see? Consider donating:
Buy me a Banana Latte ($4) There is no obligation but it would put a utf8mb4 smiley on my face, instead of the Mojibake "🙂"
MySQL Charset/Collate
http://mysql.rjweb.org/doc.php/charcoll ... ious_cases
by Rick James
Table of Contents
The problems being addressed
Basic Concepts
History
Best Practice
Conversions and Common Errors
When do bytes get translated?
Life Cycle of a Character
How Mangling Happens on INSERT
Disaster before INSERT
Diagnosing CHARSET issues
The manual
Viewing text
Fixes for various Cases
Entering accents in CMD
Example of Double-encoding
2-step ALTER
Fixing utf8 bytes in latin1 column
Fixing mismatch between CHARSET and data encoding
Fixing mix of charsets in a column
Fixing MicroSoft thingies
Fixing "double encoding"
Fix definition and data, IF correctly encoded
Testing an in-place fix
Fixing while moving data with mysqldump
Fixing while moving data with LOAD DATA
Conversion Gotchas
4-byte utf8
Functions to know about
BOM * Byte-Order-Mark
German "sharp-s"
Where does Ö sort?
Scandanavian collations
Circumflex
Quotes
When is ö not equal to ö?
"Combining" (non-spacing) characters
Cautions
Introspection
my.cnf
Stored Procedures/Functions, Triggers
JSON
utf8 Collations
utf8 Collations Examples
Changing the collation in a SELECT
ERROR 1253 (42000): COLLATION '...' is not valid for CHARACTER SET '...'
utf8 Collation Variations:
8-bit Encodings
More references
latin1 ~= cp1252 ~= ISO 8859-1
Combining Diacriticals
Programatic Collation
LOAD DATA INFILE
Characters in Western Europe
Turkish
Arabic/Farsi
Hebrew
Cyrillic, Macedonian, etc
Other Natural Languages
REGEXP / RLIKE
Python
utf-8-sig
PHP
Other Computer Languages
MySQL 8.0
Random notes
Footnote
Brought to you by Rick James
The problems being addressed
⚈ Your web page the wrong characters coming out where accented letters should be.
⚈ You are upgrading from MySQL 4.0, where characters were simply bytes.
⚈ You are trying to wrap your head around how MySQL handles CHARSET / COLLATION.
⚈ You upgraded, and now you are getting garbage.
Please read most of this document. I know it is long (after all, I wrote it!) But you really need to understand a lot of the pieces in it, in order to solve your problem.
The tips in this document apply to MySQL, MariaDB, and Percona. Version differences are called out where relevant.
Basic Concepts
"Character" and "byte" are different! You must understand this before continuing. A "byte" is an a-bit thing; it is the unit of space in computers (today). A "character" is composed of one or more bytes, and represents what we think of when reading.
A byte can represent only 256 different values. There are over 11,000 Korean characters and over 40,000 Chinese characters -- no way to squeeze such a character into a single byte.
Charset vs collation. These are different things! 'Charset' ('character set'; 'encoding') refers to the bits used to represent 'characters'. 'Collation' refers to how those bits could be compare for inequality (WHERE) and sorting (ORDER BY). GROUP BY and FOREIGN KEY CONSTRAINTS can also involve collation. And it even can involve deciding whether two different bit strings compare 'equal'.
We address when does "acute-e" (é) not show up correctly. If that acute-e shows up as A-tilde and Copyright, then there may be an issue with the browser. Try a different browser: Chrome works; Firefox is broken.
History
1950's -- A character was (sometimes) represented in only 5 bits, on "paper tape"; no lowercase; had to "shift" to switch between letters and digits.
1960's -- A character was 6 bits; no lower case. "Punched cards" were used.
1970's -- 7-bit ASCII becomes common -- that limits you to English. And the 8-bit "byte" was invented and was coming into common usage (re: IBM 360).
1980's -- Beginning to see 8-bit encodings. Especially since 7-bit ASCII was wasting a bit of the omni-present "byte". This can handle Western European accented letters.
1990's -- The computer world realizes that there are other people in the world and embarks on Unicode and UTF8. ("UTF" = "Unicode Transformation Format")
Meanwhile, MySQL is born, but has enough problems without worrying about character sets. Through version 4.0, a CHAR is just a byte. You can put any kind of bytes, representing anything, into a VARCHAR. Thus, begins the need for this discussion.
MySQL 4.1 introduced the concept of "character set" and "collation". If you had legacy data or legacy code, you probably did not notice that you were messing things up when you upgraded. Or you started with 4.1 (or later) and "latin1 / latin1_swedish_ci" and failed to notice that you were asking for trouble.
Today, it's pretty safe to simply lay down the law and say "Use utf8 for all text." If you have version 5.5.3 or later, "Use utf8mb4 for all text."
Sure, there are other character sets, and they are useful if you have a narrow focus. But, you may as well use utf8mb4.
Another take on the History
Best Practice
Best practice is to go completely utf8mb4. I will focus on utf8 and utf8mb4, but if you choose to do otherwise, keep reading; most of this discussion can still be adapted to the charset of your choice.
For collation, probably the best 'overall' collation is utf8mb4_unicode_520_ci. Thus, xxx_unicode_520_ci collations are based on UCA 5.2.0 weight keys: https://www.unicode.org/Public/UCA/5.2.0/allkeys.txt. Collations without the "520", are based on the older UCA 4.0.0.
A web page (if that is what you have) should begin with <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Establish the characterset for talking to the MySQL server:
⚈ init_connect = 'SET NAMES utf8' in my.cnf
⚈ SET NAMES utf8 -- done immediately after connecting
⚈ SET GLOBAL VARIABLE character_set_... SET NAMES does the three you need.
⚈ In some cases in your my.cnf (my.ini) file, add this under [mysql] or [client]: default-character-set = utf8
For PHP:
⚈ (deprecated; DO NOT USE mysql_* !) mysql interface: mysql_set_charset('utf8'); (assuming PHP 5.2.3 & MySQL 5.0.7) Note: "mysql_" interface is deprecated.
⚈ mysqli interface: mysqli_set_charset('utf8') function. See mysqli_set_charset
⚈ PDO interface: set the charset attribute of the PDO dsn or via SET NAMES utf8mb4.
(from #29 in Devshed's PHP tips)
PDO manual
For Java (JDBC):
⚈ ?useUnicode=yes&characterEncoding=UTF-8 in the getConnection() call.
For Hikari (perhaps):
⚈ spring.jpa.properties.hibernate.connection.characterEncoding=utf-8
⚈ spring.jpa.properties.hibernate.connection.CharSet=utf-8
⚈ spring.jpa.properties.hibernate.connection.useUnicode=true
(or maybe it is =yes and =utf8)
add
spring:
datasource:
connectionInitSql: "SET NAMES 'utf8mb4'"
in the application.yml. connectionInitSql is used by HikariCP when it open the connection.
For Java/JSP
Something about 'filters'
SET NAMES can be invoked by your language just like other non-SELECT commands: mysqli_query(), do(), execute(), etc.
I digress here for a note about GRANTs. 'root', and any other user from GRANT ALL ON *.*, because it has "SUPER" privilege, will skip the init_connect, so that technique is not 'perfect'.
SET NAMES
Declare most CHAR/TEXT columns in all tables as CHARSET utf8. Some exceptions: columns with hex (MD5, GUID), ... These should be BINARY or CHAR charset ASCII.
Collation?
For true text, utf8_unicode_ci is best. It handles the complex rules of "combining chararacters", etc. It handles essentially all languages simultaneously, but compromises on ordering.
utf8_general_ci is the default for utf8, so you may accidently get this. It is a little faster than utf8_unicode_ci and works ok for many situations.
utf8_bin says to just compare the bytes. CHAR/TEXT utf8 with utf8_bin validates (on INSERT) that the bytes comprise valid utf8 bytes, but does not do anything useful for comparisions other than exact (no case folding, etc) equality. BINARY/BLOB should be usually be used instead CHAR+utf8; this stores the bytes without any checking.
Alas, MySQL's collations think of case folding and accent-stripping as equivalent. That is, there is no utf8 collation for case folding, but keeping accents distinct. Do SHOW COLLATION LIKE '%cs'; to see the few collations that work that way.
This document is evolving. It started with just 'utf8', but the 'standard' is becoming 'utf8mb4'. The document is inconsistent as to which it specifies. If you are running MySQL before 5.5.3, you have only 'utf8'. If you need Emoji or Chinese, then you need 'utf8mb4'.
Conversions and Common Errors
This is a summary of how things work, and what is likely to go wrong.
There are 3 dimensions to character set problems:
⚈ How the client's bytes are encoded when INSERTing, and what encoding you want when SELECTing;
⚈ What SET NAMES you use (or what the default is);
⚈ The CHARACTER SET on the column definition.
All is well if the SET NAMES agrees with the encoding of the bytes in the Client. The column's CHARACTER SET need not agree with SET NAMES; if they differ, a conversion will be performed for you. If your characters exist in both encodings, the conversion will be transparent. If a source character does not exist in the target encoding (example: when converting a Chinese character from utf8mb4 to latin1), a "?" is usually put in its place.
The main thing that can go wrong is that SET NAMES can disagree with the Client's bytes. Let's walk an acute-e (é) through the INSERT and SELECT. First, note the encodings for é:
⚈ Hex for latin1's 1-byte encoding: E9
⚈ Hex for utf8's 2-byte encoding: C3A9
(BTW, this and some other encodings are enumerated below.)
Case 1: SET NAMES latin1 (the default) is in effect and your application is thinking in utf8. It INSERTs é encoded as utf8 (hex C3A9):
⚈ C3A9 will become two latin1 characters: 'é' "on the wire". (C3 represents à in latin1, etc)
⚈ If the CHARACTER SET for the column is latin1, then those 2 characters (2 bytes) are inserted.
⚈ If the CHARACTER SET for the column is utf8, then they are converted to utf8: Ã ⇒ C383 and © ⇒ C289; you insert 4 latin1 characters (4 bytes: C383C289) into the table.
Continuing the case, let's do a SELECT:
⚈ For utf8 column, the C383C289 is converted to latin1 é (C3A9) for sending across the wire to the client.
⚈ For latin1 column, no conversion is performed, so, again, C389 goes across the wire.
⚈ The client receives the two bytes C389, thinking it is é (because of SET NAMES). However, the rest of your application is thinking 'utf8', it sees it as é.
C383C289 is an example of what I call "double encoding" because of the two 'wrong' conversions from latin1 to utf8. The resulting SELECT (at least for European text) looks exactly like what you INSERTed. However, comparisions (WHERE x>y) and sorting (ORDER BY) and, in a few cases, equality (x=y) will not work as expected.
Now, let's flip the situation: Case 2: SET NAMES utf8, but you are inserting é encoded as latin1 (E9):
⚈ SET NAMES is falsely asserting that E9 is a valid utf8 encoding of something, which it is not.
⚈ The INSERT finishes (with a WARNING that you probably ignored), but the string is truncated before the é.
⚈ The CHARACTER SET of the column does not matter since the string is truncated.
⚈ A SELECT will get only the truncated string.
Case 3: One application INSERTs and another SELECTs, but they are using different SET NAMES. All sorts of messes can occur.
Case 4: You have an old latin1 table (possibly dating back to MySQL 4.0, which did not really have any CHARACTER SET concept) and you want to modernize it to utf8 or utf8mb4. First, you need to see what is in it. (A later section discusses using HEX to help.)
⚈ All ASCII -- no problem.
⚈ All latin1 (eg, E9 for é) -- Continuing to call it latin1 is fine, unless you anticipate adding some Asian text in that column.
⚈ All utf8 (eg, C3A9 for é) -- Trickier. ALTER to VARBINARY/BLOB, then ALTER back to VARCHAR/TEXT. See the 2-step ALTER below.
⚈ A mixture of latin1 and utf8 -- Warm up your resume.
Case 5: A table is declared to be latin1 and correctly contains latin1 bytes, but you would like to change it to utf8.
⚈ ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4; -- This actively changes the necessary bytes in the columns
or
⚈ ALTER TABLE tbl MODIFY col1 ... CHARACTER SET utf8mb4;
is similar to the above, but works only one column at a time, and needs exactly the right stuff in the MODIFY clause. Hence, it would be quite tedious.
What symptoms exist for the cases?
Case 1 is virtually symptomless. SELECT usually gets what was INSERTed; comparision problems may go unnoticed.
Case 2 is often recognized by having truncated strings.
Case 3 is a mess.
Case 4 is symptomless. However, if the conversion is not done properly, you can end up with some other case.
Case 5 is symptomless, but you want to branch out in the world from your original, limited, CHARACTER SET.
When do bytes get translated?
If you do all the "right things", you will accidentally have a "correctly" performing application. You need to understand when and how bytes are modified, in order to fix existing problems and to build your application correctly.
Short answer: You tell MySQL how the bytes in your application are encoded; you tell MySQL what encoding to use in your tables; MySQL converts as needed when INSERTing / SELECTing.
You tell MySQL about your application bytes by using SET NAMES or the underlying variables. You tell MySQL about the table bytes in CREATE TABLE.
Actually, the charset and collation of a column has a long, not too consistent, defaulting mechanism:
⚈ Default for system
⚈ Default for a database ...
⚈ Default for a table
⚈ Settings for a column
⚈ Fields/literals can be converted on the fly
⚈ A different collation can be specified in a WHERE clause...
To see what charset & collation are in effect for a column, use SHOW CREATE TABLE.
Life Cycle of a Character
Data ➡ Your client application ➡ INSERT into MySQL table. Simple? No.
You get the data from somewhere. Is the e-acute in "Beyoncé" represented in one latin1 byte, hex E9? Or is it the two-byte utf8 C3 A9? Not knowing which you have is the first step toward having a mess in the database. More in a minute.
Meanwhile, you have declared a column in a table in a database in MySQL to be utf8. (Collation is not matter in this discussion.)
Now you INSERT INTO Tbl (name) VALUES ("Beyoncé"). And the mess continues.
What that value being inserted encoded in latin1 or utf8? MySQL is happy to take either, and will gladly do any transcoding to store it properly in the table. But you have to tell it! The default may not be right!
Your application announces to MySQL the encoding via any of
⚈ The VARIABLE character_set_client
⚈ Executing SET NAMES
⚈ init-connect in my.cnf (my.ini) (Caution: root bypasses this)
I recommend you Execute SET NAMES in your application, right after connecting to the database. This reminds you, in your code, that you are establising for the connection.
Example of a mess: The data is in utf8, the connection settings are in latin1. The table is in utf8. So, MySQL converts each of C3 and A9 (which it was told was latin1) to utf8, leading to 4 bytes in the table: C3 83 C2 A9. I call this double-encoding. The problem was caused by lying (usually accidentally) to MySQL about what encoding was in the application's data.
Another example: The data was loaded in an old system in the following manner. The application got utf8 bytes and blindly stored them in a latin1 column (using the default latin1 settings for character_set% -- or it was done in 4.0, which had no character set concept). This left e-acute as 2 bytes in the table in a latin1 column. Then, you decide to 'do the right thing' and switch to utf8. Now what?
BIG5 example: You have an application that lets the user enter data into a form. This data is stored into a VARCHAR. All along the way, you use latin1. But the user connected from China with his browser set to BIG5. The latin1 column happily accepted the 2-byte Chinese characters. But your application reading the data has no clue of actual encoding. That same user, connecting in that same way, and rereading his data, will probably see the latin1 treated as BIG5, and be oblivious of the "two wrongs make a right" that happened. But display that text elsewhere, and you have a mess.
How Mangling Happens on INSERT
When INSERTing a string, 3 things make a difference on whether the inserted value is good or bad.
⚈ Data in client is _encoded_ latin1 versus utf8
⚈ SET NAMES (or equivalent) is latin1 versus utf8
⚈ The target column's CHARACTER SET is ascii vs latin1 vs utf8
The outcome is good or bad according to this _ordered_ list of possibilities:
Client Byte Encoding SET NAMES column CHARSET Outcome
-------------------- ---------- -------------- ---------
7-bit (eg, English) (any) (any) GOOD: bytes stored unmodified
latin1 8-bit utf8 (any) BAD: String truncated
any 8-bit byte (any) ascii BAD: '????;
utf8 (not West Europe) utf8 latin1 BAD: '????'
utf8 (west Europe) utf8 latin1 GOOD: latin1 has the equivalent values
(any) Matches Enc Matches Enc GOOD: bytes stored unmodified
(any) Matches Enc different GOOD: bytes transcoded to CHARSET
utf8 8-bit latin1 latin1 BAD: 'é' -> 'é' and similar garbage; ; see "2-step ALTER"
utf8 8-bit latin1 utf8 BAD: see "double encoding"
In the cases of truncation and '?', data is lost. In the other two 'BAD' cases, it is possible to repair the table (in 2 different ways).
Mixing of utf8 and utf8mb4 (client byte encoding vs. SET NAMES vs. CHARSET) may lead to truncation. Perhaps: SET NAMES utf8 -> truncation and CHARSET utf8 -> '?'
Disaster before INSERT
Most Mojibake starts even before the text is in the table. Let's dissect the steps of doing an INSERT.
Step 1 -- Get the data.
First you get the data from somewhere
⚈ Type it in, using a special keyboard, or keyboard tricks
⚈ Load it from a file
⚈ Create it from constants in your client programming language
⚈ Receive text from an html form
What do you have? You probably don't know. There is no single standard for what bytes are generated.
Step 2 -- INSERT
Let's walk through inserting a single character: e-acute. INSERT INTO tbl VALUES ('é'); or $str = ...; INSERT INTO tbl VALUES ('$str'); Well, this is really two steps.
Two likely cases for the encoding of e-acute:
⚈ latin1 (or code pages 1280, 1282): hex E9
⚈ utf8: 2 hex bytes: C3A9
Step 2a -- Send the command to the MySQL server
The bytes that make up the INSERT statement are sent from the client. Here enters the VARIABLE character_set_client. This tells the client what the bytes mean.
What is logically sent across the wire:
Client byte(s)
character_set_client: E9 C3A9
latin1 é Ã ©
utf8 é
⚈ Upper-left and lower-right: The correct 'character' is sent.
⚈ Upper-right: Since you said the bytes are 'latin1', then C3A9 must be the two chars é.
⚈ Lower-left: Since E9 (and what follows) is not valid utf8, the string is truncated.
Step 2b -- Do the INSERT
The 'characters' on the wire are now transliterated to the desired encoding for the table.
On the wire
column's charset: é Ã ©
latin1 E9 C3A9
utf8 C3A9 C383C2A9
⚈ Left column: all is well.
⚈ Upper-right: Mojibake. Treated as latin1, C3A9 looks like é
⚈ Lower-right: "Double encoding" -- discussed at length below.
See also: Mojibake
Thread with data entry issues.
Graphemica é
Another thread
Diagnosing CHARSET issues
You have some non-English characters that are showing up strangely, perhaps as a question mark (?), perhaps as two letters, when you were expecting one, such as ç is displayed as ç.
First, let's get the lay of the land:
SHOW SESSION VARIABLES LIKE 'character_set%';
| character_set_client | latin1
| character_set_connection | latin1
| character_set_results | latin1
(There will be other rows; these three are the important ones.) SET NAMES sets those three; this example shows the result of SET NAMES latin1, not the recommended setting. More later.
SHOW CREATE TABLE tbl will show the charset and collation for each column. More later.
You have data in a table, but it is garbled when displaying it. The first thing to do is to decide whether the data was stored correctly. Find a small cell that is bad, and do this:
SELECT x, HEX(x), LENGTH(x), CHAR_LENGTH(x) FROM ... WHERE ... LIMIT 1
You will get 4 columns:
⚈ x -- probably not useful, but may help confirm that you got the cell you wanted.
⚈ HEX(x) -- This lets us look at the encoding; more later.
⚈ LENGTH is the number of _bytes_.
⚈ CHAR_LENGTH is the number of _characters_. (Remember the distinction, above?)
LENGTH vs CHAR_LENGTH -- Are they equal? For correctly stored utf8, they should be not equal. Let's look at the two lengths of specific types of characters: Looking at a single utf8 character, would see:
⚈ ASCII: 1:1
⚈ LATIN1: 1:1
⚈ English in utf8: 1:1 -- Ascii is a subset of utf8
⚈ Western European accented letters: 2:1 -- one character is encoded in two bytes
⚈ Eastern Europe: also 2:1 (mostly)
⚈ East Asia: mostly 3:1
⚈ "double encoding" 4:1 (Europe) or 6:1 (Asian)
⚈ With utf8mb4, a few Chinese characters will be 4:1
A typical phrase in French might have LENGTH=20 and CHAR_LENGTH=17, implying that there were 17 characters, 3 of which had accents.
latin1 is happy to cram Western European characters into 1 byte.
"Double encoding" is a term I made up for the following situation. It is a case where "two wrongs make a right".
⚈ Table defined utf8
⚈ INSERTer declared latin1 (used default instead of doing SET NAMES)
⚈ Data being INSERTed is actually utf8 already.
What happened:
⚈ A 2-byte letter (say, a grave-e) was correctly represented in utf8.
⚈ The INSERT statement handed the 2 bytes to MySQL, but implied the need for conversion
⚈ Each byte was converted to utf8
⚈ The table has 4 bytes.
When SELECTing, the reverse happens 4->2->1, and the user is oblivious of the bug. Looking CHAR_LENGTH will spot it. Strange orderings may happen. details of a double encoding
A good writeup on double-encoding
Double Encoding 'answer'
Looks like Mojibake, but really half-Double-Encoding
If you are moving the data using mysqldump, consider looking at the dump file via some hexdump utility. For experimenting, mysqldump lets you specify one database, one table, and constrain the rows via --where="...".
In the hexdump, you need to be able to recognize various byte patterns: In this, "yy" stands for the range 80-BF.
⚈ 09,0A,0D -- The only control character you are likely to encounter (TAB, NL, CR)
⚈ 20 -- space (handy to know when reading hex dump)
⚈ 21-7E -- Ascii characters
⚈ C2yy -- symbols
⚈ C3yy -- Typical accented letters of Western Europe
⚈ Cxyy -- More Western Europe: Latin (C3-CA), Combining Diacritical Marks (CC-CD), Greek (CE-CF)
⚈ Dxyy -- Cyrillic (D0-D4), Hebrew (D6-D7), Arabic/Persian/Farsi (D8-DB), etc
⚈ E0yyyy -- various Indian character sets, southern Asia, etc.
⚈ E1yyyy -- Cherokee, Balinese, Khmer, Mongolian, etc.
⚈ E2yyyy -- Symbols, Braille, etc
⚈ E381yy -- Hiragana (Japanese)
⚈ E383yy -- Katakana (Japanese)
⚈ Exyyyy -- E3-EA: Chinese, Japanese; EAB0-ED9E: Hangul (Korean)
⚈ EFBBBF -- BOM indicator (start of file only)
⚈ EFAxyy -- CJK Extension A
⚈ EFACyy -- Hebrew (obscure)
⚈ EFBxyy -- Arabic (obscure)
⚈ F0yyyyyy -- 4-byte utf8. (See notes about utf8mb4)
⚈ F0Axyyyy -- CJK Extension B
Note, especially, the absence of 80-BF unless preceeded by some other 8-bit code.
Codepage layout (Wikipedia ref)
Another more detailed list
(but aimed at Unicode instead of utf8)
If the data is not utf8, you may see
⚈ 00xx -- Lots of these pairs could mean Ascii in UCS2. (Shift-JIS is also possible)
⚈ FFFE (or FEFF) at start of file (BOM) -- utf8 file (not cell in MySQL table)
⚈ 00-1F -- control characters (mostly 0a, newline)
⚈ 20 -- space (handy to know when reading hex dump)
⚈ 21-7E -- Ascii characters
⚈ 7F -- DEL; rare
⚈ 80-FF -- "8-bit characters":
⚈ 80-9F -- Possibly MicroSoft Word things (smart quotes, etc)
⚈ 80-FF -- valid latin1 encodings, mostly European accented letters
⚈ A0 -- "hard space" -- mostly from MicroSoft
⚈ Cx not followed by yy (80-BF) cannot be utf8. This is usually a quick clue that it is not utf8.
Note: The Microsoft things really need to be fixed _before_ trying to store into a table. MicroSoft characters
In hex, "double encoding" will have a lot of "Cx yy Cx yy".
The manual
charset-connection.html
is that manual page that explains the interactions of the vital variables:
⚈ character_set_client -- the encoding in the client.
⚈ character_set_connection -- what to re-code characters into for client-to-server transfer
⚈ character_set_results -- what to re-code characters into for server-to-client transfer
The page also discusses collation_connection and collation_database, which have less impact.
Between that manual page and this document, I hope that you can understand things as they relate to your situation.
To see your current settings:
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
Viewing text
Alas, CMD (Windows) and perhaps other terminal emulators cannot (or will not by default) display utf8 characters. This makes it difficult to debug. And it is confusing to see incorrect characters when everything else is working correctly.
Viewing text in a browser can be misleading in a different way. Browsers may "fix" things for you. For example, 'double encoded' characters may appear to be correct in the browser!
Bottom line: SELECT HEX() is about the only thing to trust.
Fixes for various Cases
If you are running MySQL before 5.5.3, you have only 'utf8'. If you need Emoji or Chinese, then you need 'utf8mb4'. The advice in this section assumes you will be going to utf8mb4. If you are going only to utf8, adjust accordingly.
Table is CHARACTER SET latin1 and correctly encoded in latin1; want utf8mb4:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;
Column is CHARACTER SET latin1 and correctly encoded in latin1; want utf8mb4 without touching other columns:
ALTER TABLE tbl MODIFY COLUMN col ... CHARACTER SET utf8mb4;
(Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset:
First, lets assume you have this declaration for tbl.col:
col VARCHAR(111) CHARACTER SET latin1 NOT NULL
Then to convert the column without changing the bytes:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Note: If you start with TEXT, use BLOB as the intermediate definition. (This is the "2-step ALTER", as discussed elsewhere.) (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
CHARACTER SET utf8mb4 with double-encoding:
UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);
Another ref
CHARACTER SET latin1 with double-encoding: Do the 2-step ALTER, then fix the double-encoding.
Question marks / Truncation The data is lost.
Stored Routines Stored Procedures, etc, may need to be recreated if, for example, they were built with utf8 but need to be utf8mb4. StackOverflow
See also Fixing column encoding
Entering accents in CMD
If isn’t displays as isn’t, you forgot SET NAMES utf8. If isn’t displays (in the mysql commandline tool in Windows) as isn?t, see the above info.
As noted above, CMD in Windows is limited in its character set handing.
The following is a clumsy way to enter Western European characters. It depends on your knowing the hex or decimal values for the characters you want. Sorry, this won't handle Hebrew, Cyrillic, Chinese, etc.
Press and hold ALT, then
⚈ decimal digits for the code page, or
⚈ 0, then decimal digits of the decimal Unicode
digits to get the extra characters in latin1. When you let go, the accented character will appear. This implies that you need SET NAMES latin1, not utf8, since you are creating latin1 bytes, not utf8 bytes in your 'client'. For the encodings, see
⚈ shortcuts.
More
info on how to enter Unicode, and about "code pages".
The command "chcp" controls the "code page". chcp 65001 provides utf8, but it needs a special charset installed, too. some code pages
To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console
5.6 ref manual
Note: There were bugs in 5.5's use of utf8 in the commandline 'mysql'.
GNU Unifont
Example of Double-encoding
Desired string: სახლი არის
Hex in table:
C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1
Correct utf8 hex:
E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1
Analysis:
C3A1C692C2A1 -- should be E183A1 -- Georgian SAN (U+10E1) "ს"
C3A1C692C290
C3A1C692C2AE
C3A1C692C5A1
C3A1C692CB9C
20 -- space (same in utf8 and latin1)
C3A1C692C290
C3A1C692C2A0
C3A1C692CB9C
C3A1C692C2A1
Georgian, when properly encoded in utf8, should be 3 bytes per character:
E182xx
E183xx
E2B4xx
Taking the first "character" (C3A1C692C2A1, in hex):
⚈ latin1 E1 = utf8 C3A1
⚈ latin1 83 = utf8 C692
⚈ latin1 A1 = utf8 C2A1
Notice how most of the letters look very similar: C3A1C692xxyy, when they should start with E183xx.
One pass over the data to convert each two bytes (4 hex) from utf8 to latin1:
SELECT
HEX(CONVERT(CONVERT(UNHEX(
'C3A1C692C2A1C3A1C692C290C3A1C692C2AEC3A1C692C5A1C3A1C692CB9C20C3A1C692C290C3A1C692C2A0C3A1C692CB9CC3A1C692C2A1'
) USING utf8) USING latin1))
⇒
E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1
Now to display as utf8:
SELECT UNHEX('E183A1E18390E183AEE1839AE1839820E18390E183A0E18398E183A1')
⇒
სახლი არის
(Ref: forum thread #418885.)
Another example of double-encoding: 'C398C2AF' should have been 'D8AF'
SELECT HEX(CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1));
⇒ D8AF, which is the utf8 for Arabic DAL.
SELECT CONVERT(CONVERT(UNHEX( 'C398C2AF') USING utf8) USING latin1);
⇒ that character.
(Ref: forum thread #565669.)
Another... from StackOverflow
SELECT hex('☼STICKY☼'); --> C3A2CB9CC2BC535449434B59C3A2CB9CC2BC
SELECT unhex(hex(convert(convert(unhex('C3A2CB9CC2BC535449434B59C3A2CB9CC2BC') using utf8) using latin1))) -->
☼STICKY☼ |
A blog on doubly-encoded text.
This even tackles the case of a column eith both doubly-encoded and OK text.
2-step ALTER
The "2-step ALTER" applies when you wrote utf8 bytes with SET NAMES latin1 to a CHARACTER SET latin1 column. The symptom is often (but not always) that your 'é' is showing as 'é'.
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.
Similarly, TEXT -> BLOB -> TEXT.
If col is in any indexes, you might want to DROP INDEX in the first ALTER and ADD INDEX in the second. (This is for efficiency and possibly to avoid index limitations.)
Two examples follow:
Fixing utf8 bytes in latin1 column
(The "2-step ALTER", example 1)
CAVEAT! Test before using on Production!
# Set up bad case:
mysql> create table latin1 (c varchar(11) charset latin1);
mysql> insert into latin1 (c) values ('x'), (unhex('c2a3'));
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A£ | C2A3 | 2 | 2 |
+------+--------+-----------+----------------+
# Step 1 of 2 of fix:
mysql> alter table latin1 modify c tinyblob;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A£ | C2A3 | 2 | 2 |
+------+--------+-----------+----------------+
# Step 2 of 2 of fix:
mysql> alter table latin1 modify c varchar(11) charset utf8;
mysql> select c, hex(c), length(c), char_length(c) from latin1;
+------+--------+-----------+----------------+
| c | hex(c) | length(c) | char_length(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| £ | C2A3 | 2 | 1 |
+------+--------+-----------+----------------+
That is, ALTER to BLOB (of appropriate size), then ALTER to VARCHAR (of original size) and CHARSET utf8.
This technique should work for other cases of the right encoding sitting in a 5.1 table with the wrong CHARSET declaration.
One more note: Suggest you include the COLLATE clause on the the final ALTER that converts to VARCHAR.
Fixing mismatch between CHARSET and data encoding
(The "2-step ALTER", example 2)
The problem: cp850 encoding was inserted into a latin1 field.
The following will build an example, then walk through two solutions. The solutions should work for any mis-coded field where bytes for one charset had been stuffed into a column with a different charset declaration (usually latin1). This particular example stems from Forum thread 616834
See also: cp850 encodings
latin1 and utf8 encodings
-- First, let's establish cp850 everywhere:
-- In Windows, do "chcp 850" before entering mysql.
-- Then get the mysql commandline to agree:
SET NAMES cp850; -- Now CodePage and the following match
SHOW VARIABLES LIKE 'char%'; -- note the 3 cp850 entries
Solution A: Change CHARACTER SET.
Given: cp850 bytes are stored in a latin1 field. Goal: Leave the bytes alone, but change the column to be CHARACTER SET cp850.
-- Build sample table
CREATE TABLE f616834A (
str VARCHAR(111) CHARACTER SET latin1
);
SHOW CREATE TABLE f616834A\G
-- Populate with accented E's and I's:
SELECT @str := HEX('ÈÉÊËÌÍÎÏ');
-- Expect D490D2D3DED6D7D8 (cp850 encoding)
-- Don't expect C8C9CACBCCCDCECF (latin1)
-- Don't expect C388C389C38AC38BC38CC38DC38EC38F (utf8)
SELECT UNHEX('D490D2D3DED6D7D8'); -- expect 'ÈÉÊËÌÍÎÏ'
INSERT INTO f616834A (str) VALUES (UNHEX('D490D2D3DED6D7D8')) ;
SELECT str, HEX(str) FROM f616834A;
-- Expect gibberish (Ô?ÒÓÞÖ×Ø) and D490D2D3DED6D7D8
Notice how it looks wrong but it is encoded correctly. This is because the server/client, in doing the SELECT, is transcoding from latin1 (which it isn't) to cp850. (Some combination of "chcp latin1" and "SET NAMES latin1" would prevent the gibberish, but we don't want to go that direction since it would only mask the problem.)
The "?" in the gibberish is not really a question mark, but rather the result of an unconvertable character.
The ALTER must be done in two steps:
ALTER TABLE f616834A MODIFY str VARBINARY(111);
ALTER TABLE f616834A MODIFY str VARCHAR(111) CHARACTER SET cp850;
SELECT str, HEX(str) FROM f616834A; -- Expect: ÈÉÊËÌÍÎÏ and D490D2D3DED6D7D8 (cp850)
The CHARACTER SET changed, but the data was not.
Solution B: Change Data Encoding
Given: cp850 bytes are stored in a latin1 field. Goal: Leave the column's CHARACTER SET alone, but change the bytes in the column.
SHOW VARIABLES LIKE 'char%'; -- cp850 and "chcp 850" must match at this point.
-- Build a test table:
CREATE TABLE f616834B (
str VARCHAR(111) CHARACTER SET latin1
);
SHOW CREATE TABLE f616834B\G
-- Populate with accented E's and I's:
SELECT @str := HEX('ÈÉÊËÌÍÎÏ'); -- Expect D490D2D3DED6D7D8 (cp850 encoding)
SELECT UNHEX('D490D2D3DED6D7D8'); -- Expect 'ÈÉÊËÌÍÎÏ'
INSERT INTO f616834B (str) VALUES (UNHEX('D490D2D3DED6D7D8')) ;
SELECT str, HEX(str) FROM f616834B;
-- Expect gibberish (Ô?ÒÓÞÖ×Ø) and D490D2D3DED6D7D8
UPDATE f616834B SET str = CONVERT(BINARY(str) USING cp850);
SELECT str, HEX(str) FROM f616834B; -- Expect ÈÉÊËÌÍÎÏ and C8C9CACBCCCDCECF (latin1)
(A side note: CONVERT(UNHEX(HEX(str)) USING cp850) would also work.)
Fixing mix of charsets in a column
You let the users enter stuff in a UI form, and you stored it into a latin1 field. The stuff entered was in latin1, big5, utf8, symbols, etc.
You have a mess. Without deducing, row by row, what the intent was, you cannot fix it.
Fixing MicroSoft thingies
Perhaps best is to use the REPLACE() function.
⚈ replace()
Fixing "double encoding"
CAVEAT! Test before using on Production!
*** Plan A: Use UPDATE to fix in-place:
UPDATE Tbl SET col =
IFNULL(CONVERT(CONVERT(CONVERT(col USING latin1)
USING binary)
USING utf8),
col )
Caution: I have only briefly tested this.
The IFNULL has the advantage of leaving intact most cells that are not double-encoded. (Maybe -- need further testing)
How does it work?
1. The first (innermost) CONVERT takes a 4-byte encoding and converts to 2 bytes.
2. Converting to binary and then to something else is a way to change the type without changing the bytes.
3. The final convert to utf8 prevents the assignment from further messing up
4. The last step (IFNULL) is to undo the operation if it ended up invalid utf8??
*** Plan B: Use ALTER:
mysql> # Create sample:
mysql> CREATE TABLE dbl (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO dbl (c) VALUES ('x'), (UNHEX('C382C2A4'));
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x | 78 | 1 | 1 |
| A,A☼ | C382C2A4 | 4 | 4 |
+------+----------+-----------+----------------+
mysql> # Step 1 of 5:
Note: Steps 1 and 2 are needed only if you start with latin1;
If your column is already utf8, skip to step 3.
mysql> ALTER TABLE dbl MODIFY c TINYBLOB;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x | 78 | 1 | 1 |
| A,A☼ | C382C2A4 | 4 | 4 |
+------+----------+-----------+----------------+
mysql> # Step 2 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+----------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+----------+-----------+----------------+
| x | 78 | 1 | 1 |
| A☼ | C382C2A4 | 4 | 2 |
+------+----------+-----------+----------------+
mysql> # Step 3 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET latin1;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A☼ | C2A4 | 2 | 2 |
+------+--------+-----------+----------------+
mysql> # Step 4 of 5:
mysql> ALTER TABLE dbl MODIFY c TINYBLOB;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| A☼ | C2A4 | 2 | 2 |
+------+--------+-----------+----------------+
mysql> # Step 5 of 5:
mysql> ALTER TABLE dbl MODIFY c VARCHAR(11) CHARSET utf8;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM dbl;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| ☼ | C2A4 | 2 | 1 |
+------+--------+-----------+----------------+
mysql> # Check schema:
mysql> SHOW CREATE TABLE dbl \G
CREATE TABLE `dbl` (
`c` varchar(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
One more note: Suggest you include the COLLATE clause on the the final ALTER that converts to VARCHAR.
*** Plan C: Using mysqldump
(This has had only limited testing.)
1. mysqldump --default-character-set=latin1 .... my_database > my_database_latin1.sql
2. Edit my_database_latin1.sql to set NAMES=utf8 at the top.
3. mysql ... < mydatabase.sqlmo
Thanks to Peter Barry
*** Plan D: A function
DELIMITER $$
CREATE FUNCTION maybe_utf8_decode(str text charset utf8)
RETURNS text CHARSET utf8 DETERMINISTIC
BEGIN
declare str_converted text charset utf8;
declare max_error_count int default @@max_error_count;
set @@max_error_count = 0;
set str_converted = convert(binary convert(str using latin1) using utf8);
set @@max_error_count = max_error_count;
if @@warning_count > 0 then
return str;
else
return str_converted;
end if;
END$$
DELIMITER ;
Then do this for each column in each table that is likely to have doubly-encoded text:
UPDATE tbl SET col = maybe_utf8_decode(col);
Thanks for Plan D to Joni Salonen for fixing-doubly-utf-8-encoded-text-in-mysql
Fix definition and data, IF correctly encoded
You have a latin1 table containing latin1 encodings; you want utf8. (This is not the general thrust of this document, but is included for completeness, and how tempting looking the command is.)
mysql> # Create sample:
mysql> CREATE TABLE trycvt (c VARCHAR(11) CHARSET latin1);
mysql> INSERT INTO trycvt (c) VALUES ('x'), (UNHEX('BB'));
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
`c` varchar(11) CHARACTER SET latin1 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| » | BB | 1 | 1 |
+------+--------+-----------+----------------+
mysql> # Step 1 of 1
mysql> ALTER TABLE trycvt CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
mysql> # Diagnose:
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) FROM trycvt;
+------+--------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------+--------+-----------+----------------+
| x | 78 | 1 | 1 |
| » | C2BB | 2 | 1 |
+------+--------+-----------+----------------+
mysql> # Note schema changed:
mysql> SHOW CREATE TABLE trycvt \G
CREATE TABLE `trycvt` (
`c` varchar(11) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Note: TEXT will become MEDIUMTEXT, etc. This is because transcoding latin1 to utf8 usually lengthens the data. You may want to follow with a ALTER TABLE ... MODIFY c TEXT ..esume. to put it back.
Note about ALTER: Generally it is possible and faster to do all ALTERs of a table in a single ALTER;
Haidong Ji Blog
performanceblog
Testing an in-place fix
Some of the above fixes require live changes. Here is a way that might build confidence that it works:
⚈ find some rows with accents (etc) in them
⚈ create a table with the _same_ schema; copy (INSERT SELECT) the rows into them (INSERT ... SELECT)
⚈ Try one of the conversions on that table
⚈ Check the results -- now length and char_length should be different, and should differ by the appropriate amount. Fetch the rows in you application; see if they look right.
Fixing while moving data with mysqldump
You are copying the data from, say, 4.x server to a new, say, 5.1 server. It may be easiest to convert the data as you copy. You can test the results. If necessary, you can adjust and reapply the conversion.
Overview:
mysqldump -h old_host --no-data >schema.sql
# now edit schema.sql to fix charset, collation, etc
# edit in/out SET NAMES
mysql -h new_host <schema.sql
mysqldump -h old_host --no-create-info --opt >data.sql
# edit in/out SET NAMES
mysql -h new_host <data.sql
Notes (and things that may vary with our situation)
⚈ May need --default-character-set=utf8
⚈ Other issues?
⚈ (aside) Do not dump the database information_schema
⚈ (aside) Do not dump and reload mysql if you are moving between versions; there are usually schema changes that will cause trouble.
Fixing while moving data with LOAD DATA
⚈ SELECT INTO OUTFILE may create something useful for LOAD DATA.
⚈ Perhaps use mysqldump suggestions for creating schema
⚈ LOAD DATA ... CHARACTER SET utf8 ...; (as of 5.0.38 / 5.1.17)
⚈ Sorry, no experience in this approach.
Conversion Gotchas
⚈ DROP and re-add FULLTEXT indexes around changes.
⚈ DROP and re-add all(?) FOREIGN KEY CONSTRANTs; perhaps disabling will suffice
⚈ Collation changes could lead to duplicated keys
⚈ Collation changes could lead to broken VARCHAR FOREIGN KEY CONSTRANTs
4-byte utf8
Inserting a 4-byte utf8 character into a column with CHARACTER SET utf8 will cause the string to be truncated starting with the 4-byte character.
MySQL Versions 5.1 and before do not handle utf8 characters longer than 4 bytes. Version 5.5.3 (March, 2010) has a new CHARACTER SET, utf8mb4, for utf8 encodings up to 4 bytes.
⚈ Introducing utf8mb4
⚈ Upgrading, including sample ALTERs
⚈ 'f0 a2 88 a2' showing up as 'd848 de22'
⚈ Graphemica
⚈ Winking Face
utf8mb4 in necessary for a small number of Chinese characters.
The MySQL changelog for 5.5.3 says "The Unicode implementation has been extended to provide support for supplementary characters that lie outside the Basic Multilingual Plane (BMP)."
If your product is going to store characters from Hong Kong, utf8 is not a good option since many characters in HKSCS are 4-byte utf8 characters. utf8mb4 would be better.
Good article on a variety of related issues such as SET NAMES, utf8mb4, collations, converting from utf8 to utf8mb4, index lengths, character-set-client-handshake, init-connect, Java, etc. is by Mathias Bynens
A lot of the meat is in the Comment section.
Much of this blog mentions utf8. Unless otherwise specified, utf8mb4 applies, too. Occasionally, "utf8" may refer to the standard, not specifically to MySQL's 3-byte subset.
=> F09F9881
VARCHAR(191) -- Indexes are limited to 767 bytes, which is sufficient to hold 255 utf8 bytes. However, utf8mb4 can take 4 bytes. So, the limit for building an index of a utf8mb4 column is 191. So, if you are picking 255 for no good reason, and if 191 will do, use 191. To check the max length in a field:
SELECT MAX(CHAR_LENGTH(col)) FROM tbl;
How to upgrade utf8 to utf8mb4? Probably this will do:
ALTER TABLE tbl MODIFY COLUMN col ... CHARACTER SET utf8mb4 ...;
Emoji (such as ) are all equal under utf8mb4_unicode_520_ci, but different with COLLATE utf8mb4_unicode_ci. Stackoverflow
If you currently have utf8 on all the columns in a table, and want to covert to utf8mb4,
ALTER TABLE tbl_name CONVERT TO utf8mb4;
However, this may fail for index issues. If you have an indexed VARCHAR(255), that turns out to exceed the 767-byte limit. Change it to VARCHAR(191) if the column does not really need 255. If you really need more than 191, things get more complicated. StackExchange thread
Repeat the CONVERT on the other tables.
Functions to know about
Since you may be using some of these against the source (old) database, be cautious of whether the function exists.
String functions
⚈ CHAR_LENGTH(field) -- num chars in field
⚈ LENGTH(field) -- num bytes in field
⚈ CHAR(123 USING latin1)
⚈ CHARSET(expression)
⚈ UPPER(_binary'Abc') = 'Abc' -- a subtle surprise (other string functions, too)
⚈ ORD(char) = ASCII(char) = number -- for the _char_, not the _byte_
⚈ HEX(string) -- use to see actual encoding
⚈ UNHEX('AABBCC') -- does not exist before 4.1.2
Convert functions
⚈ CONVERT('...' USING utf8)
⚈ CONVERT(blob_col USING utf8) -- no "USING" before 4.1.16
⚈ HEX(CAST(_ucs 0x00a0 as CHAR CHARACTER SET utf8)) -- show that Unicode 00a0 is utf8 C2A0
Literal introducer
⚈ _utf8'...'
PHP
⚈ mysql: mysql_set_charset('utf8');
⚈ mysqli: $mysqli_obj->set_charset('utf8');
⚈ PDO: $db = new PDO('mysql:host=...;dbname=...;charset=utf8mb4', $user, $pwd); or by executing SET NAMES utf8mb4.
Note: the use of mysqli_set_charset / set_charset should obviate the need for SET NAMES.
If you change the collation of a field, ORDER BY cannot use any INDEX; hence it could be surprisingly inefficient.
BOM * Byte-Order-Mark
Wikipedia
says
The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this.
If, instead, you see this hex: C3AFC2BBC2BF, then you are also suffering from double encoding.
If you encounter this (presumably at the start of a file), it implies that your editor is adding this, but the the reader of the file (eg, mysqldump) is does not know what to do with it. Check the editor's settings, or consider using a different editor.
Error #1064 SQL syntax at line 1 * strange characters
more on BOM
If you see The ÿþ (hex FFFE) at the beginning of a 'text' file, that is the Byte-Order-Mark.
Python issue with BOM
ERROR 1366 (HY000) at line ...: Incorrect string value: '\xEF\xBB\xBFC# BOM
open(filename, 'r', encoding='utf-8-bom')
German "sharp-s"
Version 5.1.24 introduced a change in the collation of the German "sharp-s". bug 27877
(Apr. 2007) Starting with 5.1.24, 's' = 'ß' for utf8_general_ci. 'ss' = 'ß' for utf8_unicode_ci, which is 'correct' for German.
A nasty effect of this collation change occurs when a user upgrades past 5.1.24. If there were sharp-s characters in PRIMARY (or UNIQUE) KEYs, the reloading of the data may cause "duplicate key" errors.
To 'fix' the incompatibility, the collation utf8_general_mysql500_ci was added in MySQL 5.1.62 / 5.5.21 / 5.6.5. See ref page
Graphemica
More on what to do.
(Copied here:)
New utf8_general_mysql500_ci and ucs2_general_mysql500_ci collations have been added that preserve the behavior of utf8_general_ci and ucs2_general_ci from versions of MySQL previous to 5.1.24. [CHECK TABLE](http://dev.mysql.comhttps://dev.mysql.c ... table.html) produces this error:
Table upgrade required. Please do "REPAIR TABLE t" or dump/reload to fix it!
Unfortunately, [REPAIR TABLE](http://dev.mysql.comhttps://dev.mysql.c ... table.html) could not fix the problem. The new collations permit older tables created before MySQL 5.1.24 to be upgraded to current versions of MySQL.
To convert an affected table after a binary upgrade that leaves the table files in place, alter the table to use the new collation. Suppose that the table t1 contains one or more problematic utf8 columns. To convert the table at the table level, use a statement like this:
ALTER TABLE t1 CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci;
To apply the change on a column-specific basis, use a statement like this (be sure to repeat the column definition as originally specified except for the COLLATE clause):
ALTER TABLE t1 MODIFY c1 CHAR(N) CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci;
To upgrade the table using a dump and reload procedure, dump the table using mysqldump, modify the [CREATE TABLE](http://dev.mysql.comhttps://dev.mysql.c ... table.html) statement in the dump file to use the new collation, and reload the table.
After making the appropriate changes, [CHECK TABLE](http://dev.mysql.comhttps://dev.mysql.c ... table.html) should report no error.
For more information, see [Checking Whether Tables or Indexes Must Be Rebuilt](http://dev.mysql.comhttps://dev.mysql.c ... ities.html), and [Rebuilding or Repairing Tables or Indexes](http://dev.mysql.comhttps://dev.mysql.c ... ables.html). (Bug #43593, Bug #11752408)
Where does Ö sort?
⚈ utf8_estonian_ci: Between 'W' and 'X'
⚈ danish, icelandic, swedish: After 'Z'
⚈ utf8_gernan2_ci: As if the two letters 'oe'
⚈ hungarian and turkish: Between 'O' and 'P' (that is, after 'oz')
⚈ Other collations (including unicode, unicode_520, 0900): As if the letter 'O'
These apply to both utf8 and utf8mb4, MySQL 8.0 and before.
Programatically generated collation tester
Scandanavian collations
Among utf8_danish_ci, utf8_icelandic_ci, and utf8_swedish_ci (and utf8mb4 variant):
icelandic differs due to: AZ < Á < B, E < É, IZ < Í < J, UZ < Ü < V, YZ < Ý < Z. Also
danish: zz < Ä=Æ=ä=æ < Ö=Ø=ö=ø < Aa=Å=å < Þ=þ
icelandic: zz < Þ=þ < Ä=Æ=ä=æ < Ö=Ø=ö=ø < Å=å
swedish: zz < Å=å < Ä=Æ=ä=æ < Ö=Ø=ö=ø < Þ=þ
(This may not have all the differences.) Programatically generated collation tester
Circumflex
There are many versions of "circumflex":
unicode: 005E = (^) utf8: 5E = CIRCUMFLEX ACCENT = SPACING CIRCUMFLEX
unicode: 02C6 = (ˆ) utf8: CB86 = MODIFIER LETTER CIRCUMFLEX ACCENT = MODIFIER LETTER CIRCUMFLEX
unicode: 0302 = utf8: CC82 = COMBINING CIRCUMFLEX ACCENT = NON-SPACING CIRCUMFLEX
unicode: 032D = utf8: CCAD = COMBINING CIRCUMFLEX ACCENT BELOW = NON-SPACING CIRCUMFLEX BELOW
unicode: 1DCD = utf8: E1B78D = COMBINING DOUBLE CIRCUMFLEX ABOVE
unicode: A788 = (ꞈ) utf8: EA9E88 = MODIFIER LETTER LOW CIRCUMFLEX ACCENT
unicode: FF3E = (^) utf8: EFBCBE = FULLWIDTH CIRCUMFLEX ACCENT = FULLWIDTH SPACING CIRCUMFLEX
unicode: E005E = utf8: F3A0819E = TAG CIRCUMFLEX ACCENT
This debates what latin1's 0x88 (ˆ) maps to: forum discussion
Graphemica
Quotes
+----------+---------+------+----------------------------------------------------+
| utf8mb4 | Unicode | c | Name |
+----------+---------+------+----------------------------------------------------+
| 22 | 34 | " | QUOTATION MARK |
| C2AB | 171 | « | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK |
| C2BB | 187 | » | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK |
| E28098 | 8216 | ‘ | LEFT SINGLE QUOTATION MARK | 91
| E28099 | 8217 | ’ | RIGHT SINGLE QUOTATION MARK | 92
| E2809A | 8218 | ‚ | SINGLE LOW-9 QUOTATION MARK |
| E2809B | 8219 | ‛ | SINGLE HIGH-REVERSED-9 QUOTATION MARK |
| E2809C | 8220 | “ | LEFT DOUBLE QUOTATION MARK | 93
| E2809D | 8221 | ” | RIGHT DOUBLE QUOTATION MARK | 94
| E2809E | 8222 | „ | DOUBLE LOW-9 QUOTATION MARK |
| E2809F | 8223 | ‟ | DOUBLE HIGH-REVERSED-9 QUOTATION MARK |
| E280B9 | 8249 | ‹ | SINGLE LEFT-POINTING ANGLE QUOTATION MARK |
| E280BA | 8250 | › | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK |
| E28D98 | 9048 | ⍘ | APL FUNCTIONAL SYMBOL QUOTE UNDERBAR |
| E28D9E | 9054 | ⍞ | APL FUNCTIONAL SYMBOL QUOTE QUAD |
| E29D9B | 10075 | ❛ | HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT |
| E29D9C | 10076 | ❜ | HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT |
| E29D9D | 10077 | ❝ | HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT |
| E29D9E | 10078 | ❞ | HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT |
| E29DAE | 10094 | ❮ | HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT |
| E29DAF | 10095 | ❯ | HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT |
| E3809D | 12317 | 〝 | REVERSED DOUBLE PRIME QUOTATION MARK |
| E3809E | 12318 | 〞 | DOUBLE PRIME QUOTATION MARK |
| E3809F | 12319 | 〟 | LOW DOUBLE PRIME QUOTATION MARK |
| EA9084 | 41988 | ꐄ | YI SYLLABLE QUOT |
| EFBC82 | 65282 | " | FULLWIDTH QUOTATION MARK |
| F3A080A2 | 917538 | ó ¢ | TAG QUOTATION MARK |
+----------+---------+------+----------------------------------------------------+
⚈ 91..94 are the hex for latin1 if coming from Microsoft Word.
⚈ F3A080A2 requires utf8mb4; the rest work in both utf8 and utf8mb4 (graphic probably not rendered correctly)
When is ö not equal to ö?
First, let's discuss when 'ö' is not equal to 'o'. (I will focus on utf8/utf8mb4 here.) 'o' != 'ö' for utf8_bin collation and 7 others (danish, icelandic, swedish, estonian, turkish, hungarian, and german2); see utf8collations
For MySQL 8.0 and utf8mb4, there are 12: utf8mb4collations
How could 'ö' not be equal to 'ö'? When one is a single character (hex C3B6) and the other is a combination of a 'combining diaeressis' plus 'o' -- hex CC886F.
Further discussion: Searching with German Umlauts
"Combining" (non-spacing) characters
These can be used for constructing accented letters (etc) from the code plus the letter. But keep in mind that (at least) utf8_general_ci will not treat such 2-letter codes as equal to the apparently same 1-letter equivalent.
+----------+----------------------------------------------------+---------+
| hex_utf8 | name | Unicode |
+----------+----------------------------------------------------+---------+
| CC80 | COMBINING GRAVE ACCENT | U+0300 |
| CC81 | COMBINING ACUTE ACCENT | U+0301 |
| CC82 | COMBINING CIRCUMFLEX ACCENT | U+0302 |
| CC83 | COMBINING TILDE | U+0303 |
| CC84 | COMBINING MACRON | U+0304 |
| CC85 | COMBINING OVERLINE | U+0305 |
| CC86 | COMBINING BREVE | U+0306 |
| CC87 | COMBINING DOT ABOVE | U+0307 |
| CC88 | COMBINING DIAERESIS | U+0308 |
| CC89 | COMBINING HOOK ABOVE | U+0309 |
| CC8A | COMBINING RING ABOVE | U+030A |
| CC8B | COMBINING DOUBLE ACUTE ACCENT | U+030B |
| CC8C | COMBINING CARON | U+030C |
| CC8D | COMBINING VERTICAL LINE ABOVE | U+030D |
| CC8E | COMBINING DOUBLE VERTICAL LINE ABOVE | U+030E |
| CC8F | COMBINING DOUBLE GRAVE ACCENT | U+030F |
| CC90 | COMBINING CANDRABINDU | U+0310 |
| CC91 | COMBINING INVERTED BREVE | U+0311 |
| CC92 | COMBINING TURNED COMMA ABOVE | U+0312 |
| CC93 | COMBINING COMMA ABOVE | U+0313 |
| CC94 | COMBINING REVERSED COMMA ABOVE | U+0314 |
| CC95 | COMBINING COMMA ABOVE RIGHT | U+0315 |
| CC96 | COMBINING GRAVE ACCENT BELOW | U+0316 |
| CC97 | COMBINING ACUTE ACCENT BELOW | U+0317 |
| CC98 | COMBINING LEFT TACK BELOW | U+0318 |
| CC99 | COMBINING RIGHT TACK BELOW | U+0319 |
| CC9A | COMBINING LEFT ANGLE ABOVE | U+031A |
| CC9B | COMBINING HORN | U+031B |
| CC9C | COMBINING LEFT HALF RING BELOW | U+031C |
| CC9D | COMBINING UP TACK BELOW | U+031D |
| CC9E | COMBINING DOWN TACK BELOW | U+031E |
| CC9F | COMBINING PLUS SIGN BELOW | U+031F |
| CCA0 | COMBINING MINUS SIGN BELOW | U+0320 |
| CCA1 | COMBINING PALATALIZED HOOK BELOW | U+0321 |
| CCA2 | COMBINING RETROFLEX HOOK BELOW | U+0322 |
| CCA3 | COMBINING DOT BELOW | U+0323 |
| CCA4 | COMBINING DIAERESIS BELOW | U+0324 |
| CCA5 | COMBINING RING BELOW | U+0325 |
| CCA6 | COMBINING COMMA BELOW | U+0326 |
| CCA7 | COMBINING CEDILLA | U+0327 |
| CCA8 | COMBINING OGONEK | U+0328 |
| CCA9 | COMBINING VERTICAL LINE BELOW | U+0329 |
| CCAA | COMBINING BRIDGE BELOW | U+032A |
| CCAB | COMBINING INVERTED DOUBLE ARCH BELOW | U+032B |
| CCAC | COMBINING CARON BELOW | U+032C |
| CCAD | COMBINING CIRCUMFLEX ACCENT BELOW | U+032D |
| CCAE | COMBINING BREVE BELOW | U+032E |
| CCAF | COMBINING INVERTED BREVE BELOW | U+032F |
| CCB0 | COMBINING TILDE BELOW | U+0330 |
| CCB1 | COMBINING MACRON BELOW | U+0331 |
| CCB2 | COMBINING LOW LINE | U+0332 |
| CCB3 | COMBINING DOUBLE LOW LINE | U+0333 |
| CCB4 | COMBINING TILDE OVERLAY | U+0334 |
| CCB5 | COMBINING SHORT STROKE OVERLAY | U+0335 |
| CCB6 | COMBINING LONG STROKE OVERLAY | U+0336 |
| CCB7 | COMBINING SHORT SOLIDUS OVERLAY | U+0337 |
| CCB8 | COMBINING LONG SOLIDUS OVERLAY | U+0338 |
| CCB9 | COMBINING RIGHT HALF RING BELOW | U+0339 |
| CCBA | COMBINING INVERTED BRIDGE BELOW | U+033A |
| CCBB | COMBINING SQUARE BELOW | U+033B |
| CCBC | COMBINING SEAGULL BELOW | U+033C |
| CCBD | COMBINING X ABOVE | U+033D |
| CCBE | COMBINING VERTICAL TILDE | U+033E |
| CCBF | COMBINING DOUBLE OVERLINE | U+033F |
| CD80 | COMBINING GRAVE TONE MARK | U+0340 |
| CD81 | COMBINING ACUTE TONE MARK | U+0341 |
| CD84 | COMBINING GREEK DIALYTIKA TONOS | U+0344 |
| CD85 | COMBINING GREEK YPOGEGRAMMENI | U+0345 |
| D283 | COMBINING CYRILLIC TITLO | U+0483 |
| D284 | COMBINING CYRILLIC PALATALIZATION | U+0484 |
| D285 | COMBINING CYRILLIC DASIA PNEUMATA | U+0485 |
| D286 | COMBINING CYRILLIC PSILI PNEUMATA | U+0486 |
| E28390 | COMBINING LEFT HARPOON ABOVE | U+20D0 |
| E28391 | COMBINING RIGHT HARPOON ABOVE | U+20D1 |
| E28392 | COMBINING LONG VERTICAL LINE OVERLAY | U+20D2 |
| E28393 | COMBINING SHORT VERTICAL LINE OVERLAY | U+20D3 |
| E28394 | COMBINING ANTICLOCKWISE ARROW ABOVE | U+20D4 |
| E28395 | COMBINING CLOCKWISE ARROW ABOVE | U+20D5 |
| E28396 | COMBINING LEFT ARROW ABOVE | U+20D6 |
| E28397 | COMBINING RIGHT ARROW ABOVE | U+20D7 |
| E28398 | COMBINING RING OVERLAY | U+20D8 |
| E28399 | COMBINING CLOCKWISE RING OVERLAY | U+20D9 |
| E2839A | COMBINING ANTICLOCKWISE RING OVERLAY | U+20DA |
| E2839B | COMBINING THREE DOTS ABOVE | U+20DB |
| E2839C | COMBINING FOUR DOTS ABOVE | U+20DC |
| E283A1 | COMBINING LEFT RIGHT ARROW ABOVE | U+20E1 |
| E38299 | COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK | U+3099 |
| E3829A | COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK | U+309A |
+----------+----------------------------------------------------+---------+
88 rows in set (0.01 sec)
Cautions
If you have FOREIGN KEY CONSTRAINTs on fields you are 'fixing', you will probably mess up the constraints. It might work to disable FK constraints, fix both tables, then re-enable them.
If you have FOREIGN KEY CONSTRAINTs on fields other than the ones you are 'fixing', it is probably best to disable/reenable them.
If your table is huge, and/or you have lots of indexes, the ALTERs can be very time consuming. It may be better to disable/drop the indexes during the 'fix', then reapply them at the end. Do not drop an AUTO_INCREMENT index. In the case of InnoDB, do not drop the PRIMARY KEY.
Introspection
To see what charsets and collations exist in your version of MySQL:
SHOW CHARSET; -- only latin1 and utf8/utf8mb4 are likely to be interesting
SHOW COLLATION; -- list the possible values.
SHOW COLLATION LIKE 'utf8%';
SHOW COLLATION LIKE '%_cs';
SHOW SESSION VARIABLES LIKE 'character_set%';
SHOW GLOBAL VARIABLES LIKE 'character_set%';
SHOW CREATE TABLE tbl \G
Collation name suffixes
⚈ _ci -- case insensitive; this includes ignoring accents
⚈ _cs -- case sensitive (and accent sensitive)
⚈ _bin -- just compare the bits
my.cnf
The following lines in your my.cnf initialize various things in various contexts. (It is untested.) It may fail to be recognized for SUPER user:
[client]
default-character-set = cp1252
[mysqld]
skip-character-set-client-handshake
# (deprecated in mysqld) default-character-set = cp1252
character-set-server = cp1252
collation-server = latin1_general_cs
init-connect = SET NAMES cp1252
# (init-connect is _not_ seen by SUPER users, such as "root".)
[mysqldump]
default-character-set = cp1252
[mysql]
default-character-set = cp1252
Caution: Setting default-character-set that way for the mysql client cannot be overridden by SET NAMES.
in my.ini in the [mysqld] group: skip-character-set-client-handshake collation_server=utf8_unicode_ci character_set_server=utf8 This may speed up connections, and avoid need for SET NAMES.
Stored Procedures/Functions, Triggers
ALTER DATABASE dbname CHARACTER SET ... can impact already compiled Stored Procedures. Recompile them.
There may be other charset issues. In particular, a routine seems to have a charset, but that declaration seems to have no effect.
JSON
If your JSON strings are showing things like "\uff62\u30c9\u30e9\u30b4\u30f3\u30b3\u30a4", you have "escaped unicode". (Those are unicode characters for "「ドラゴンコ".)
If you would prefer utf8, then this should work in PHP (since version 5.4):
$t = json_encode($s, JSON_UNESCAPED_UNICODE);
Before PHP 5.4:
function json_encode_unicode($input) { return preg_replace_callback( '/\\\\u([0-9a-zA-Z]{4})/', function ($matches) { return mb_convert_encoding(pack('H*',$matches[1]),'UTF-8','UTF-16'); }, json_encode($input) ); }
utf8 Collations
Which to pick? Why?
⚈ utf8_bin -- just looks at bits; no case folding, etc
⚈ utf8_general_ci -- case and accent folding.
⚈ utf8_unicode_ci -- Perhaps best. Dipthongs, combining accents, etc are handled
⚈ utf8_..._ci -- If your application is entirely (or mostly) in one language, this may be better for you. See the chart below.
Notes on utf8_general_ci vs utf8_unicode_ci
⚈ utf8_general_ci is a very simple collation. What it does * it just - removes all accents - then converts to upper case
⚈ utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss"; letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
⚈ utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.
⚈ utf8_unicode_ci is generally more accurate for all scripts.
⚈ For Cyrillic: utf8_unicode_ci is fine for these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
⚈ The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
⚈ from StackOverflow
utf8 Collations Examples
What pairs of things will be treated equal? What order does a collation cause? How do I override a column's collation? (Caution: Doing so usually prevents effecient use of an INDEX.) Examples below show altering an ORDER BY clause and an "=" operator.
Here is test code for the results below:
drop table charutf8;
-- Create and populate a table with a "utf8_bin" column:
CREATE TABLE `charutf8` (
`a` varchar(9) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT IGNORE INTO charutf8 (a) VALUES
('abc'), ('def'), ('ABC'), ('ábc'), ('́abc'), ('Abç'), ('ss'), ('ß');
-- Dump of characters, and lengths:
SELECT a, CHAR_LENGTH(a), LENGTH(a), HEX(a)
FROM charutf8
ORDER BY a;
-- CC81 is "U+0301: COMBINING ACUTE ACCENT" which,
-- together with 'a' is equal to 'á' in utf8_unicode_ci
Unicode canonical equivalence
-- Show that the ordering varies with collation:
SELECT GROUP_CONCAT(a ORDER BY a) AS 'bin order' FROM charutf8;
SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_general_ci) AS 'general_ci order' FROM charutf8;
SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_unicode_ci) AS 'unicode_ci order' FROM charutf8;
-- Self-JOIN to see what values are the same; collation-dependent:
SELECT CHAR_LENGTH(x.a) AS clen, LENGTH(x.a), x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_bin'
FROM charutf8 x
JOIN charutf8 y ON x.a = y.a
GROUP BY x.a;
SELECT x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_general_ci'
FROM charutf8 x
JOIN charutf8 y ON x.a = y.a COLLATE utf8_general_ci
GROUP BY x.a;
SELECT x.a, GROUP_CONCAT(y.a) AS 'Equal with utf8_unicode_ci'
FROM charutf8 x
JOIN charutf8 y ON x.a = y.a COLLATE utf8_unicode_ci
GROUP BY x.a;
Results and conclusions:
-- Dump of characters, and lengths:
+-------+----------------+-----------+------------+
| a | CHAR_LENGTH(a) | LENGTH(a) | HEX(a) |
+-------+----------------+-----------+------------+
| ABC | 3 | 3 | 414243 |
| Abç | 3 | 4 | 4162C3A7 |
| abc | 3 | 3 | 616263 |
| def | 3 | 3 | 646566 |
| ss | 2 | 2 | 7373 |
| ß | 1 | 2 | C39F |
| ábc | 3 | 4 | C3A16263 |
| ́abc | 4 | 5 | CC81616263 | -- the misalignment is due to the output not handling "combining" chars correctly
+-------+----------------+-----------+------------+
8 rows in set (0.00 sec)
-- CC81 is "U+0301: COMBINING ACUTE ACCENT" which,
-- together with 'a' is equal to 'á' in utf8_unicode_ci
-- Show that the ordering varies with collation:
+-----------------------------------+
| bin order |
+-----------------------------------+
| ABC,Abç,abc,def,ss,ß,ábc,́abc | -- Notice how accents follow all letters
+-----------------------------------+
mysql> SELECT GROUP_CONCAT(a ORDER BY a COLLATE utf8_general_ci) AS 'general_ci order' FROM charutf8;
+-----------------------------------+
| general_ci order |
+-----------------------------------+
| ábc,abc,Abç,ABC,def,ß,ss,́abc | -- Regular accents mixed in; Combining guy is not'
+-----------------------------------+
+-----------------------------------+
| unicode_ci order |
+-----------------------------------+
| ́abc,ábc,abc,Abç,ABC,def,ß,ss | -- Combining accent sorted with its mates
+-----------------------------------+
-- Self-JOIN to see what values are the same; collation-dependent:
+-------+---------------------+
| a | Equal with utf8_bin |
+-------+---------------------+
| ABC | ABC | -- exact match for case and accents
| Abç | Abç |
| abc | abc |
| def | def |
| ss | ss |
| ß | ß | -- ss and ß are NOT equal
| ábc | ábc |
| ́abc | ́abc |
+-------+---------------------+
+-------+----------------------------+
| a | Equal with utf8_general_ci |
+-------+----------------------------+
| ABC | ábc,ABC,Abç,abc | -- case folding and accent stripping
| Abç | ábc,ABC,Abç,abc |
| abc | ábc,ABC,Abç,abc |
| def | def |
| ss | ss | -- ss and ß are NOT equal
| ß | ß |
| ábc | ABC,Abç,abc,ábc |
| ́abc | ́abc | -- 'combining' NOT equal
+-------+----------------------------+
+-------+----------------------------+
| a | Equal with utf8_unicode_ci |
+-------+----------------------------+
| ABC | ábc,ABC,́abc,Abç,abc | -- case folding and accent stripping
| Abç | abc,ábc,ABC,́abc,Abç |
| abc | Abç,abc,ábc,ABC,́abc |
| def | def |
| ss | ss,ß | -- ss and ß are EQUAL
| ß | ß,ss |
| ábc | ábc,ABC,́abc,Abç,abc |
| ́abc | abc,ábc,ABC,́abc,Abç | -- 'combining' EQUAL
+-------+----------------------------+
Changing the collation in a SELECT
Suppose you have
city VARCHAR(66) COLLATE utf8_bin
and you need to find "San José" when you do
WHERE city = "San Jose"
This applies a different collation for the comparision:
WHERE city COLLATE utf8_general_ci = 'San Jose'
Caveat: An INDEX(city) cannot be used when you override the collation, so the latter WHERE will be slower.
ERROR 1253 (42000): COLLATION '...' is not valid for CHARACTER SET '...'
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = 'rené' collate utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = 'rené' collate utf8_general_ci;
Empty set (0.00 sec)
Altenatively, use can explicitly set the character set using a 'character set introducer':
mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from User where username = _utf8'rené' collate utf8_general_ci;
Empty set (0.00 sec)
StackOverflow reference
utf8 Collation Variations:
See also a more complete table for utf8 and utf8mb4 in MySQL 8.0.
Here's a comparison of the collation of the alphabet characters for the utf8 collations you might consider.
⚈ spanish2 puts ch after cz and ll after lz and Ñ,ñ after nz
⚈ utf8_general_ci treats sharp-s as 's'; the rest treat it as 'ss'.
⚈ Ligature (Æ, etc) collation varies.
⚈ Eth (Ð) is treated like a letter between D and E.
⚈ utf8_estonian_ci does some major shuffling (Z between S and T, etc).
⚈ Danish sorts Ä=Æ=ä=æ Ö=Ø=ö=ø Aa=Å=å Þ=þ after z.
utf8 : utf8_general_ci A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az
utf8 : utf8_general_mysql500_ci A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az
utf8 : utf8_danish_ci A=a=ª=À=Á=Â=Ã=à=á=â=ã=Ā=ā=Ą=ą ae az
utf8 : utf8_swedish_ci A=a=ª=À=Á=Â=Ã=à=á=â=ã=Ā=ā=Ą=ą Aa ae az
utf8 : utf8_estonian_ci A=a=ª=À=Á=Â=Ã=Å=à=á=â=ã=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_lithuanian_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_latvian_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_unicode_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_polish_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā Aa ae az Ą=ą Æ=æ
utf8 : utf8_spanish_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_spanish2_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_turkish_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8 : utf8_general_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz
utf8 : utf8_general_mysql500_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz
utf8 : utf8_danish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_swedish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_estonian_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_lithuanian_ci B=b C=c=ch=Ç=ç cz Č=č D=d dz Ð=ð
utf8 : utf8_latvian_ci B=b C=c=Ç=ç ch cz Č=č D=d dz Ð=ð
utf8 : utf8_unicode_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_polish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_spanish_ci B=b C=c=Ç=ç=Č=č ch cz D=d dz Ð=ð
utf8 : utf8_spanish2_ci B=b C=c=Ç=ç=Č=č cz ch D=d dz Ð=ð
utf8 : utf8_turkish_ci B=b C=c=Č=č ch cz Ç=ç D=d dz Ð=ð
utf8 : utf8_general_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_general_mysql500_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_danish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_swedish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_estonian_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_lithuanian_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_latvian_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_unicode_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_polish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ě=ě Ę=ę
utf8 : utf8_spanish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_spanish2_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_turkish_ci E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě
utf8 : utf8_general_ci F=f fz G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_general_mysql500_ci F=f fz G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_danish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_swedish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_estonian_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_lithuanian_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_latvian_ci F=f fz ƒ G=g=ğ Ģ=ģ H=h hz
utf8 : utf8_unicode_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_polish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_spanish_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_spanish2_ci F=f fz ƒ G=g=ğ=Ģ=ģ H=h hz
utf8 : utf8_turkish_ci F=f fz ƒ G=g=Ģ=ģ H=h hz
utf8 : utf8_general_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ=ı ij iz
utf8 : utf8_general_mysql500_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ=ı ij iz
utf8 : utf8_danish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_swedish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_estonian_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_lithuanian_ci I=Y=i=y=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz=yz
utf8 : utf8_latvian_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz Y=y yz
utf8 : utf8_unicode_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_polish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_spanish_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_spanish2_ci I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_turkish_ci I=ı i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz
utf8 : utf8_general_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj ll lz M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_general_mysql500_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj ll lz M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_danish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_swedish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_estonian_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_lithuanian_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_latvian_ci J=j K=k Ķ=ķ L=l lj=LJ=Lj=lj ll lz Ļ=ļ Ł=ł M=m N=n=Ñ=ñ nz Ņ=ņ
utf8 : utf8_unicode_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_polish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_spanish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ņ=ņ nz Ñ=ñ
utf8 : utf8_spanish2_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj lz ll Ł=ł M=m N=n=Ņ=ņ nz Ñ=ñ
utf8 : utf8_turkish_ci J=j K=k=Ķ=ķ L=l=Ļ=ļ lj=LJ=Lj=lj ll lz Ł=ł M=m N=n=Ñ=ñ=Ņ=ņ nz
utf8 : utf8_general_ci O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe oz
utf8 : utf8_general_mysql500_ci O=o=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe oz
utf8 : utf8_danish_ci O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ oe=Œ=œ oz
utf8 : utf8_swedish_ci O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ oe=Œ=œ oz
utf8 : utf8_estonian_ci O=o=º=Ò=Ó=Ô=ò=ó=ô oe=Œ=œ oz Ø=ø
utf8 : utf8_lithuanian_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø
utf8 : utf8_latvian_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø
utf8 : utf8_unicode_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø
utf8 : utf8_polish_ci O=o=º=Ò=Ô=Õ=Ö=ò=ô=õ=ö oe=Œ=œ oz Ó=ó Ø=ø
utf8 : utf8_spanish_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø
utf8 : utf8_spanish2_ci O=o=º=Ò=Ó=Ô=Õ=Ö=ò=ó=ô=õ=ö oe=Œ=œ oz Ø=ø
utf8 : utf8_turkish_ci O=o=º=Ò=Ó=Ô=Õ=ò=ó=ô=õ oe=Œ=œ oz Ö=ö Ø=ø
utf8 : utf8_general_ci P=p Q=q R=r S=s=ß=ş=Š=Š=š=š sh ss sz
utf8 : utf8_general_mysql500_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss sz
utf8 : utf8_danish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_swedish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_estonian_ci P=p Q=q R=r S=s=ş sh ss=ß sz Š=Š=š=š Z=z zh zz Ž=Ž=ž=ž
utf8 : utf8_lithuanian_ci P=p Q=q R=r S=s=ş sh ss=ß sz Š=Š=š=š
utf8 : utf8_latvian_ci P=p Q=q R=r S=s=ş sh ss=ß sz Š=Š=š=š
utf8 : utf8_unicode_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_polish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_spanish_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_spanish2_ci P=p Q=q R=r S=s=ş=Š=Š=š=š sh ss=ß sz
utf8 : utf8_turkish_ci P=p Q=q R=r S=s=Š=Š=š=š sh ss=ß sz
utf8 : utf8_general_ci T=t TM=tm tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_general_mysql500_ci T=t TM=tm tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_danish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz
utf8 : utf8_swedish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz
utf8 : utf8_estonian_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz
utf8 : utf8_lithuanian_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_latvian_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_unicode_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_polish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_spanish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_spanish2_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ų=ų ue uz
utf8 : utf8_turkish_ci T=t TM=tm=™ tz U=u=Ù=Ú=Û=ù=ú=û=Ū=ū=Ų=ų ue uz Ü=ü
utf8 : utf8_general_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_general_mysql500_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_danish_ci V=v W=w X=x Y=y=Ü=Ý=ü=ý=ÿ=Ÿ yz
utf8 : utf8_swedish_ci V=v W=w X=x Y=y=Ü=Ý=ü=ý=ÿ=Ÿ yz
utf8 : utf8_estonian_ci V=v W=w Õ=õ Ä=ä Ö=ö Ü=ü X=x Y=y=Ý=ý=ÿ=Ÿ yz Þ=þ
utf8 : utf8_lithuanian_ci V=v W=w X=x Ý=ý=ÿ=Ÿ
utf8 : utf8_latvian_ci V=v W=w X=x Ý=ý=ÿ=Ÿ
utf8 : utf8_unicode_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_polish_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_spanish_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_spanish2_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_turkish_ci V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz
utf8 : utf8_general_ci Z=z=Ž=Ž=ž=ž zh zz Æ=æ Ð=ð Ø=ø Þ=þ Ł=ł Œ=œ ƒ LJ=Lj=lj
utf8 : utf8_general_mysql500_ci Z=z=Ž=Ž=ž=ž zh zz Æ=æ Ð=ð Ø=ø Þ=þ ß Ł=ł Œ=œ ƒ LJ=Lj=lj
utf8 : utf8_danish_ci Z=z=Ž=Ž=ž=ž zh zz Ä=Æ=ä=æ Ö=Ø=ö=ø Aa=Å=å Þ=þ
utf8 : utf8_swedish_ci Z=z=Ž=Ž=ž=ž zh zz Å=å Ä=Æ=ä=æ Ö=Ø=ö=ø Þ=þ
utf8 : utf8_estonian_ci
utf8 : utf8_lithuanian_ci Z=z zh zz Ž=Ž=ž=ž Þ=þ
utf8 : utf8_latvian_ci Z=z zh zz Ž=Ž=ž=ž Þ=þ
utf8 : utf8_unicode_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ
utf8 : utf8_polish_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ
utf8 : utf8_spanish_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ
utf8 : utf8_spanish2_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ
utf8 : utf8_turkish_ci Z=z=Ž=Ž=ž=ž zh zz Þ=þ
utf8mb4_% mostly works the same as utf8.
Unfortunately, it is not easy to separate case folding from accent stripping. Here is what someone has created: utf8_distinct_ci
8-bit Encodings
The columns are decimal, latin1 hex, utf8 hex, char, htmlentity.
128 80 E282AC € €
129 81 C281 
130 82 E2809A ‚ ‚
131 83 C692 ƒ ƒ
132 84 E2809E „ „
133 85 E280A6 … …
134 86 E280A0 † †
135 87 E280A1 ‡ ‡
136 88 CB86 ˆ ˆ
137 89 E280B0 ‰ ‰
138 8A C5A0 Š Š
139 8B E280B9 ‹ ‹
140 8C C592 Œ Œ
141 8D C28D 
142 8E C5BD Ž Ž
143 8F C28F 
144 90 C290 
145 91 E28098 ‘ ‘
146 92 E28099 ’ ’
147 93 E2809C “ “
148 94 E2809D ” ”
149 95 E280A2 • •
150 96 E28093 – –
151 97 E28094 — —
152 98 CB9C ˜ ˜
153 99 E284A2 ™ ™
154 9A C5A1 š š
155 9B E280BA › ›
156 9C C593 œ œ
157 9D C29D 
158 9E C5BE ž ž
159 9F C5B8 Ÿ Ÿ
160 A0 C2A0
161 A1 C2A1 ¡ ¡
162 A2 C2A2 ¢ ¢
163 A3 C2A3 £ £
164 A4 C2A4 ¤ ¤
165 A5 C2A5 ¥ ¥
166 A6 C2A6 ¦ ¦
167 A7 C2A7 § §
168 A8 C2A8 ¨ ¨
169 A9 C2A9 © ©
170 AA C2AA ª ª
171 AB C2AB « «
172 AC C2AC ¬ ¬
173 AD C2AD ­
174 AE C2AE ® ®
175 AF C2AF ¯ ¯
176 B0 C2B0 ° °
177 B1 C2B1 ± ±
178 B2 C2B2 ² ²
179 B3 C2B3 ³ ³
180 B4 C2B4 ´ ´
181 B5 C2B5 µ µ
182 B6 C2B6 ¶ ¶
183 B7 C2B7 · ·
184 B8 C2B8 ¸ ¸
185 B9 C2B9 ¹ ¹
186 BA C2BA º º
187 BB C2BB » »
188 BC C2BC ¼ ¼
189 BD C2BD ½ ½
190 BE C2BE ¾ ¾
191 BF C2BF ¿ ¿
192 C0 C380 À À
193 C1 C381 Á Á
194 C2 C382 Â Â
195 C3 C383 Ã Ã
196 C4 C384 Ä Ä
197 C5 C385 Å Å
198 C6 C386 Æ Æ
199 C7 C387 Ç Ç
200 C8 C388 È È
201 C9 C389 É É
202 CA C38A Ê Ê
203 CB C38B Ë Ë
204 CC C38C Ì Ì
205 CD C38D Í Í
206 CE C38E Î Î
207 CF C38F Ï Ï
208 D0 C390 Ð Ð
209 D1 C391 Ñ Ñ
210 D2 C392 Ò Ò
211 D3 C393 Ó Ó
212 D4 C394 Ô Ô
213 D5 C395 Õ Õ
214 D6 C396 Ö Ö
215 D7 C397 × ×
216 D8 C398 Ø Ø
217 D9 C399 Ù Ù
218 DA C39A Ú Ú
219 DB C39B Û Û
220 DC C39C Ü Ü
221 DD C39D Ý Ý
222 DE C39E Þ Þ
223 DF C39F ß ß
224 E0 C3A0 à à
225 E1 C3A1 á á
226 E2 C3A2 â â
227 E3 C3A3 ã ã
228 E4 C3A4 ä ä
229 E5 C3A5 å å
230 E6 C3A6 æ æ
231 E7 C3A7 ç ç
232 E8 C3A8 è è
233 E9 C3A9 é é
234 EA C3AA ê ê
235 EB C3AB ë ë
236 EC C3AC ì ì
237 ED C3AD í í
238 EE C3AE î î
239 EF C3AF ï ï
240 F0 C3B0 ð ð
241 F1 C3B1 ñ ñ
242 F2 C3B2 ò ò
243 F3 C3B3 ó ó
244 F4 C3B4 ô ô
245 F5 C3B5 õ õ
246 F6 C3B6 ö ö
247 F7 C3B7 ÷ ÷
248 F8 C3B8 ø ø
249 F9 C3B9 ù ù
250 FA C3BA ú ú
251 FB C3BB û û
252 FC C3BC ü ü
253 FD C3BD ý ý
254 FE C3BE þ þ
255 FF C3BF ÿ ÿ
The 'missing' characters in the above table:
⚈ 129
⚈ 141 reverse line feed
⚈ 143
⚈ 144
⚈ 157
⚈ 173 shows when breaking across lines Soft Hyphen
Misinterpreting the utf8 bytes as latin1. The last column is what you may be seeing.
128 80 E282AC € €
129 81 C281 Â
130 82 E2809A ‚ ‚
131 83 C692 ƒ Æ’
132 84 E2809E „ „
133 85 E280A6 … …
134 86 E280A0 † â€
135 87 E280A1 ‡ ‡
136 88 CB86 ˆ ˆ
137 89 E280B0 ‰ ‰
138 8A C5A0 Š Å
139 8B E280B9 ‹ ‹
140 8C C592 Œ Å’
141 8D C28D Â
142 8E C5BD Ž Ž
143 8F C28F Â
144 90 C290 Â
145 91 E28098 ‘ ‘
146 92 E28099 ’ ’
147 93 E2809C “ “
148 94 E2809D ” â€
149 95 E280A2 • •
150 96 E28093 – –
151 97 E28094 — —
152 98 CB9C ˜ Ëœ
153 99 E284A2 ™ â„¢
154 9A C5A1 š Å¡
155 9B E280BA › ›
156 9C C593 œ Å“
157 9D C29D Â
158 9E C5BE ž ž
159 9F C5B8 Ÿ Ÿ
160 A0 C2A0 Â
161 A1 C2A1 ¡ ¡
162 A2 C2A2 ¢ ¢
163 A3 C2A3 £ £
164 A4 C2A4 ¤ ¤
165 A5 C2A5 ¥ Â¥
166 A6 C2A6 ¦ ¦
167 A7 C2A7 § §
168 A8 C2A8 ¨ ¨
169 A9 C2A9 © ©
170 AA C2AA ª ª
171 AB C2AB « «
172 AC C2AC ¬ ¬
173 AD C2AD Â
174 AE C2AE ® ®
175 AF C2AF ¯ ¯
176 B0 C2B0 ° °
177 B1 C2B1 ± ±
178 B2 C2B2 ² ²
179 B3 C2B3 ³ ³
180 B4 C2B4 ´ ´
181 B5 C2B5 µ µ
182 B6 C2B6 ¶ ¶
183 B7 C2B7 · ·
184 B8 C2B8 ¸ ¸
185 B9 C2B9 ¹ ¹
186 BA C2BA º º
187 BB C2BB » »
188 BC C2BC ¼ ¼
189 BD C2BD ½ ½
190 BE C2BE ¾ ¾
191 BF C2BF ¿ ¿
192 C0 C380 À À
193 C1 C381 Á Ã
194 C2 C382  Â
195 C3 C383 à Ã
196 C4 C384 Ä Ã„
197 C5 C385 Å Ã…
198 C6 C386 Æ Ã†
199 C7 C387 Ç Ã‡
200 C8 C388 È Ãˆ
201 C9 C389 É Ã‰
202 CA C38A Ê ÃŠ
203 CB C38B Ë Ã‹
204 CC C38C Ì ÃŒ
205 CD C38D Í Ã
206 CE C38E Î ÃŽ
207 CF C38F Ï Ã
208 D0 C390 Ð Ã
209 D1 C391 Ñ Ã‘
210 D2 C392 Ò Ã’
211 D3 C393 Ó Ã“
212 D4 C394 Ô Ã”
213 D5 C395 Õ Ã•
214 D6 C396 Ö Ã–
215 D7 C397 × Ã—
216 D8 C398 Ø Ã˜
217 D9 C399 ٠Ù
218 DA C39A Ú Ãš
219 DB C39B Û Ã›
220 DC C39C Ü Ãœ
221 DD C39D Ý Ã
222 DE C39E Þ Ãž
223 DF C39F ß ÃŸ
224 E0 C3A0 à Ã
225 E1 C3A1 á á
226 E2 C3A2 â â
227 E3 C3A3 ã ã
228 E4 C3A4 ä ä
229 E5 C3A5 å Ã¥
230 E6 C3A6 æ æ
231 E7 C3A7 ç ç
232 E8 C3A8 è è
233 E9 C3A9 é é
234 EA C3AA ê ê
235 EB C3AB ë ë
236 EC C3AC ì ì
237 ED C3AD í Ã
238 EE C3AE î î
239 EF C3AF ï ï
240 F0 C3B0 ð ð
241 F1 C3B1 ñ ñ
242 F2 C3B2 ò ò
243 F3 C3B3 ó ó
244 F4 C3B4 ô ô
245 F5 C3B5 õ õ
246 F6 C3B6 ö ö
247 F7 C3B7 ÷ ÷
248 F8 C3B8 ø ø
249 F9 C3B9 ù ù
250 FA C3BA ú ú
251 FB C3BB û û
252 FC C3BC ü ü
253 FD C3BD ý ý
254 FE C3BE þ þ
255 FF C3BF ÿ ÿ
More references
Various discussions on the MySQL forums:
Please note that any PHP code using the mysql_* API should be replaced by the mysqli_* API; the former is deprecated.
PHP/html/mysql usage
Migrating from latin1 to utf-8
Convert database from utf8_general_ci to uft8_unicode_ci
Latin1 to unicode character convert -- data are display with ?????
Problem with accents
help in relating tables and normailsation
Select statement with town Polish characters
connection collation
cannot change character set
joined query problem
Slow retreieve under MyIsam
UTF-8 problem
trying to move a database from mysql 4.0.18 to 5.0.77
Server just hanging - Copying to tmp table
Encoding problem...
Unicode collation, utf8 data fetch/manip problem
What storage requirement has FLOAT(24) ? Doc is ambigous
what is the step to convert a db from latin1 to UTF8
Special characters and database
Require suggestion for changing character set in Master master replication setup
Character encodings issues with php, mysql, apache
load data character set question
Storing UTF-8 strings in MySQL Database
utf8 inputs store as html entities, how to retrievesubstr of utf8 inputs?
Is STR_TO_DATE function doesn't work with UCS2 charset?
Miscellany:
Converting through BLOB
Wikipedia shows 0xD5 0xF5 iso upper/lower
shows utf8 c590 / c591
details several representations for 0150
discussion about how charsets are used
MySQL 5.5 & 4-byte utf8
MSDN discussion of Unicode
unicode.org -- The definitive word, but heavy going
mysqldump issues
Example of double-encoding and mysqldump
Windows, Unicode, LOAD DATA, little-endian, etc
Charbase - A visual unicode database
Adding a custom collation
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Unicode.scarfboy
UTF8 all the way through
latin1 ~= cp1252 ~= ISO 8859-1
latin1 is the default character set. MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d. 8859 and some comparisions
Combining Diacriticals
Some characters can be encoded as either a single character or as a letter and a "combining" acceent. For example, ä can be represented in utf8 as either c3a4 or 61cc88. The ä as a single 'character' is c3a4. The other encoding is 61, the hex code for 'a', plus cc88 is the utf-8 code for the 'combining diaeresis' (umlaut, double dot, etc). Wikipedia: Combining Diacritical Marks
If the collation is utf8_general_ci, the two encodings are treated as unequal. With utf8_unicode_ci, they are equal. Searching database for strings with Swedish characters
Programatic Collation
Suppose you have need for a collation that cannot be performed by any existing collation. Create two columns instead of one; let's call them exact and canonical. exact contains the correctly spelled (caps, accents, etc) value, and it has a suitable collation. canonical is a programmatically-altered version of exact. This may include
⚈ stripping accents
⚈ folding to lowercase
⚈ removing redundant whitespace (blanks, newlines, tabs)
⚈ removing punctuation
⚈ removing markup (such as html )
⚈ it could even include providing alternate spellings or synonyms
canonical would also have an index, possibly FULLTEXT and/or utf8mb4_bin. canonical may have duplicates even though exact probably does not.
Then, when the 'user' searches, you do the same folding to "canonical" before searching canonical. Display from exact.
LOAD DATA INFILE
In LOAD DATA INFILE, the manual points out:
"The server uses the character set indicated by the character_set_database system variable to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause. A character set of binary specifies 'no conversion'."
Example:
LOAD DATA LOCAL INFILE '...'
INTO TABLE tbl
CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
ESCAPED BY '\\'
IGNORE 1 ROWS;
Characters in Western Europe
Taken from Wikipedia
å, ä and ö in Finnish and Swedish
à, ç, è, é, ï, í, ò, ó, ú, ü in Catalan
å, æ and ø in Norwegian and Danish
á, é, ó, ý, è, ë, ï in Dutch
ä, ö, ü and ß in German
á, ð, í, ó, ú, ý, æ and ø in Faroese
á, ð, é, í, ó, ú, ý, þ, æ and ö in Icelandic
à, â, ç, è, é, ë, ê, ï, î, ö, ô, ù, û, ÿ, æ, œ in French
à, è, é, ì, ò, ù in Italian
á, é, í, ñ, ó, ú, ï, ü, ¡, ¿ in Spanish
à, á, â, ã, ç, é, ê, í, ó, ô, õ, ú in Portuguese (ü no longer used)
á, é, í, ó, ú in Irish
£ in British English
Turkish
The columns: utf8, unicode codepoint decimal, hex, HTML entity, 'name':
C387 0199 00C7 Ç latin capital letter c with cedilla
C396 0214 00D6 Ö latin capital letter o with diaeresis
C39C 0220 00DC Ü latin capital letter u with diaeresis
C3A7 0231 00E7 ç latin small letter c with cedilla
C3B6 0246 00F6 ö latin small letter o with diaeresis
C3BC 0252 00FC ü latin small letter u with diaeresis
C49E 0286 011E Ğ latin capital letter g with breve
C49F 0287 011F ğ latin small letter g with breve
C4B0 0304 0130 İ latin capital letter i with dot above
C4B1 0305 0131 ı latin small letter dotless i
C59E 0350 015E Ş latin capital letter s with cedilla
C59F 0351 015F ş latin small letter s with cedilla
E282A4 8356 20A4 ₤ lira sign
E282BA 8378 20BA ₺ Turkish lira sign
Don't expect latin1 to work well:
CREATE TABLE `se96265` (
`content` mediumtext NOT NULL,
FULLTEXT KEY `content` (`content`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
SET NAMES utf8;
-- Turkish word for Croatia, second char is \xC4\xB1
INSERT INTO `se96265` (`content`) VALUES ('Hırvatistan');
SELECT content, HEX(content), HEX('Hırvatistan') FROM se96265\G
*************************** 1. row ***************************
content: H?rvatistan
HEX(content): 483F72766174697374616E
HEX('Hırvatistan'): 48C4B172766174697374616E
Note how the 'C4B1' turned into '?'
To get an error:
mysql> SET SESSION sql_mode = 'STRICT_ALL_TABLES'; -- So the INSERT will cause an error:
mysql> INSERT INTO `se96265` (`content`) VALUES ('Hırvatistan');
ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1
(Note: This technique for getting an error message is not specific to Turkish.)
ISO/IEC 8859-9 = latin5 = Turkish
If you have Turkish text in a latin1 column/table, you should convert to latin5 in order to correctly handle the 6 characters that are different (latin1: ÐÝÞðýþ --> latin5 ĞİŞğış). See Stackoverflow
Arabic/Farsi
Arabic characters 2 bytes in utf8: Dxyy. For example, Arabic Letter 'HAH' is D8AD.
To get diacritic insensitive comparisons, use utf8_unicode_ci:
SELECT 'بِسْمِ' = 'بسم' COLLATE utf8_unicode_ci;
The result I got back was 1 (true), meaning they are considered equal. With utf8_general_ci it came back with 0, meaning not equal. (Of course, you should declare your column(s) as utf8_unicode_ci rather than using the COLLATE clause.)
Stored Farsi, got '???'
Urdu case folding
Hebrew
To search for a particular character in a string, LIKE or REGEXP will work.
To search for "a cell being entirely Hebrew" is not easy. Essentially, you look for the utf8 encoding, then build a REGEXP to accomodate it. For example, to see which rows have a Hebrew name (in utf8):
WHERE HEX(name) REGEXP '^(D7..)+$'
See forum discussion
This assumes (not quite perfectly) that all Hebrew encodings are D7xx. This can be adapted to other languages if the encoding is not to complex. Chinese, for example, is complex:
WHERE HEX(name) REGEXP '^(E.....)+$'
⚈ Chinese characters are 3 bytes (6 hex) long
⚈ This fails to handle the 4-byte Chinese characters
⚈ This includes Japanese, Cherokee, etc.
If you have the "hebrew" encoding in a latin1 column, you may need a 3-step ALTER. See stackoverflow
Storing Hebrew strings Ok, but gibberish when pulling data on another server
Cyrillic, Macedonian, etc
(Random notes)
Cyrillic in utf8 is encoded D0yy-D4yy and EA99yy-EA9Ayy, but perhaps only D0yy is in common use.
For Cyrillic: utf8_unicode_ci is fine for these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
Ќ is hex D08C in utf8 or utf8mb4. Cyrillic is completely covered by either CHARACTER SET. К is D09A. Hmmm, this is quite interesting:
mysql> SELECT 'К' = 'Ќ' COLLATE utf8_bin AS bin,
'К' = 'Ќ' COLLATE utf8_general_ci AS general,
'К' = 'Ќ' COLLATE utf8_unicode_ci AS unicode;
+-----+---------+---------+
| bin | general | unicode |
+-----+---------+---------+
| 0 | 1 | 0 |
+-----+---------+---------+f
mysql> SELECT 'Г' = 'Ѓ' COLLATE utf8_bin AS bin,
'Г' = 'Ѓ' COLLATE utf8_general_ci AS general,
'Г' = 'Ѓ' COLLATE utf8_unicode_ci AS unicode;
+-----+---------+---------+
| bin | general | unicode |
+-----+---------+---------+
| 0 | 1 | 0 |
+-----+---------+---------+
And if it helps, here's the method from the alphabet helper. def alphabet cyrillic_alphabet = ["А", "Б", "В", "Г", "Д", "Ѓ", "Е", "Ж", "З", "Ѕ", "И", "Ј", "К", "Л", "Њ", "М", "Н", "Љ", "О", "П", "Р", "С","Т", "Ќ", "У", "Ф", "Х", "Ц", "Ч", "Џ", "Ш"] end – NewbieOnRails yesterday
These are letters in Macedonian alphabet, not on the Ubuntu Cyrillic keyboard.
⚈ D086 1030=x0406 [І] L CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
⚈ D087 1031=x0407 [Ї] L CYRILLIC CAPITAL LETTER YI
⚈ D08B 1035=x040B [Ћ] L CYRILLIC CAPITAL LETTER TSHE
⚈ D08E 1038=x040E [Ў] L CYRILLIC CAPITAL LETTER SHORT U
The bit patterns are different, so utf8_bin collates unequal. But usually whenever general is equal, so is unicode. Cyrillic/Macedonian
Cyrillic example:
title: Добро пожаловать в Диксон,
hex: D094D0BED0B1D180D0BE20D0BFD0BED0B6D0B0D0BBD0BED0B2D0B0D182D18C20D0B220D094D0B8D0BAD181D0BED0BD,
length: 47,
char length: 25
Data entry of Cyrillic in Workbench yields "????"
Other Natural Languages
This is a list hits about natural languages. Most have not been verified.
Lithuanian ąĄ čČ ęĘ ėĖ įĮ ųŲ ūŪ are saved as ?, however, šŠ žŽ do get saved. The reason: Column declared latin1
Vietnamese collations were added in MySQL 5.6.
em dash, dagger, and other special characters
mysql> SELECT ';' semicolon, UNHEX('cdbe') 'Greek question mark';
+-----------+---------------------+
| semicolon | Greek question mark |
+-----------+---------------------+
| ; | ; |
+-----------+---------------------+
cp1250 - Central Europe
For a Devanagari character range:
SELECT 'क़ज़फ़' REGEXP '^[क़-फ़]+$'; --> true -- But that is mere coincidence!
Stackoverflow thread
UTF8 Chinese String Comparison
REGEXP / RLIKE
Do not attempt to use most of the features of REGEXP to test for non-ascii characters; it will generally not work in utf8. MySQL's REGEXP only works with bytes, not utf8 characters.
Greek alpha (α) is hex CEB1; omega (ω) is hex CF89
* Character sets don't work right: [αω] checks for 4 bytes: CE, B1, CF, 89, which has very little to do with checking for those two letters. * Ranges also don't work right: [α-ω] checks for byte CE, range B1-CF, and byte 89. * . is one byte, not one character. * etc.
A few tricks will work:
* WHERE HEX(str) REGEXP '(..)*CEB1CF89' -- will check for 'αω' in the column; but you may as well do WHERE str LIKE '%αω%'. * WHERE HEX(str) REGEXP '(..)*F' -- will check for whether the UTF-8 string needs utf8mb4. (But how did you get str stored in the first place??)
Python
1st or 2nd line in source code (to have literals in the code utf8-encoded): # -*- coding: utf-8 -*-
Python code for dumping hex (etc) for string 'u':
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
Miscellany notes on coding for utf8:
⚈ db = MySQLdb.connect(host=DB_HOST, user=DB_USER, passwd=DB_PASS, db=DB_NAME, charset="utf8", use_unicode=True)
⚈ conn = MySQLdb.connect(host="localhost", user='root', password='', db='', charset='utf8')
⚈ config = 'OPTIONS': { ..., 'charset': 'utf8mb4', ... } -- in "settings.py"?
⚈ cursor.execute("SET NAMES utf8mb4;") -- not as good as using `charset'
⚈ db.set_character_set('utf8'), implies use_unicode=True
⚈ Literals should be u'...'
⚈ MySQL-python 1.2.4 fixes a bug wherein varchar(255) CHARACTER SET utf8 COLLATE utf8_bin is treated like a BLOB.
Checklist:
⚈ # -*- coding: utf-8 -*- -- (for literals in code) (This tells the interpreter which encoding to use when parsing the file, it affects only files that contain non-ascii characters (literals). It does not effect the code in any other way.)
⚈ charset='utf8' in connect() call -- Is that buried in bottle_mysql.Plugin? (Note: Try 'utf-8' and 'utf8')
⚈ Text encoded in utf8.
⚈ No need for encode() or decode() if you are willing to accept utf8 everywhere.
⚈ u'...' for literals
⚈ <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> near start of html page
⚈ <meta charset="utf-8" /> is a shorter version (if using only HTML5).
⚈ Content-Type: text/html; charset=UTF-8 (in HTTP response header)
⚈ header('Content-Type: text/html; charset=UTF-8'); (in PHP to get that response header)
⚈ CHARACTER SET utf8 COLLATE utf8_general_ci on column (or table) definition in MySQL.
⚈ utf8 all the way through
⚈ Use MySQL Connector/Python instead of pyodbc and MySQL Connector/ODBC
Side note: Python defaults to autocommit=off; this can be a rude surprise when your INSERTs vanish.
Showing hex: var.encode().hex()
References:
⚈ https://docs.python.org/2/howto/unicode ... icode-type
⚈ https://stackoverflow.com/questions/915 ... ding-mysql
⚈ https://dev.mysql.com/doc/connector-pyt ... targs.html
⚈ https://stackoverflow.com/questions/340 ... e-bytecode
How to store arabic text in mysql database using python?
The Python language environment officially only uses UCS-2 internally since version 2.0, but the UTF-8 decoder to "Unicode" produces correct UTF-16. Since Python 2.2, "wide" builds of Unicode are supported which use UTF-32 instead;[16] these are primarily used on Linux. Python 3.3 no longer ever uses UTF-16, instead strings are stored in one of ASCII/Latin-1, UCS-2, or UTF-32, depending on which code points are in the string, with a UTF-8 version also included so that repeated conversions to UTF-8 are fast.
Python check for utf8 (versus utf8mb4)
The python package "MySQL-python" version needs to be at least 1.2.5 in order to handle utf8mb4.
import sys
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Python 2.7 issues; improvements in Python 3
json.dumps(mydict, ensure_ascii=False) avoids "\u...." strings.
If you have \u00ef type text, consider this for fixing it: codecs.decode(body, 'unicode-escape')
utf-8-sig
From codecs — Codec registry and base classes
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
PHP
⚈ If using mysqli, do $mysqli_obj->set_charset('utf8mb4');
⚈ If using PDO do somethin like $db = new PDO('dblib:host=host;dbname=db;charset=utf8mb4', $user, $pwd);
⚈ Alternatively, execute SET NAMES utf8mb4
Any of these will say that the bytes in the client are UTF-8 encoded. Conversion, if necessary, will occur between the client and the database if the column definition is something other than utf8mb4.
Be sure to have php-mbstring installed (see php.ini)
Random notes about PHP. (See also Best Practice, above.)
In php.ini: default_charset UTF-8 The default changed to UTF-8 in PHP 5.6.
To see what PHP has:
$hex = unpack('H*', $text);
echo implode('', $hex);
or simply echo bin2hex($text);
To get "U+" (unicode) notation, first fetch the data, then use PHP code trim(str_replace('\u', ' U+', json_encode('كتب عربية')), '" '); This will result in U+0643 U+062a U+0628 U+0639 U+0631 U+0628 U+064a U+0629.
PHP: Convert \unnnn to utf8:
$myr = '\u101c\u1031\u1011'; // Given this
$x = json_decode('"'.$myr.'"'); // Add quotes and decode
// $x is not string, like လေထ
If you use header(), do: header('Content-type: text/plain; charset=utf-8');
PHP: undo_magic_quotes_gpc($_POST);
BjyProfiler needs charset=utf8; it apparently overrides other PHP settings.
If using mysqli, character_set_name() (formerly spelled client_encoding()) does NOT mirror SET NAMES.
$doc->loadHTML(''. $content);
Other Computer Languages
This is a list of hints on how to establish utf8 for various computer languages. Most have not been verified.
ADO.NET "charset=utf8"
AJAX AJAX
encodeURIComponent() may be useful.
Apache servers are (were?) configured to serve files in ISO-8859-1 by default, so you need to add the following line to your .htaccess file:
AddDefaultCharset UTF-8
ASP, ASP.Net
<%Response.charset="utf-8"%>
Mathias Bynens
Various tips
including:
Apache Config (/etc/httpd/conf/httpd.conf) :
AddDefaultCharset UTF-8 Apache .htaccess file :
AddCharset UTF-8 .htm AddCharset UTF-8 .html AddCharset UTF-8 .php
Connector/C
mysql_set_character_set('utf8mb4');
C#: C# code to undo (while reading?) double-encoded strings
Similar code in SQL:
CONVERT(UNHEX(HEX(CONVERT(CONVERT(col USING utf8) USING latin1))), CHAR)
Put this in the connection string:
id=my_user;password=my_password;database=some_db123;charset=utf8;
Lao and C#
Console.OutputEncoding = System.Text.Encoding.UTF8;
Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword; CharSet=utf8; Note! Use lower case value utf8 and not upper case UTF8 as this will fail. Connector/Net
codeigniter:
application/config/config.php - Note the dash :
$config['charset'] = 'UTF-8';
application/config/database.php :
$db['default'] = array(
...
'char_set' => 'utf8mb4',
'dbcollat' => 'utf8mb4_unicode_520_ci',
Note: Those are aimed at MySQL. dbcollat is used only for PHP's mysqli.
httppost.setEntity(new UrlEncodedFormEntity(nameValuePairs, HTTP.UTF_8));
echo meta('Content-type', 'text/html; charset=utf-8', 'equiv');
In database.php, Set char_set to utf8mb4 and dbcollat to utf8mb4_unicode_ci or utf8_general_ci Codeigniter 3.0 notes
ColdFusion
<cfprocessingdirective pageEncoding="utf-8">
<cffile
action="read"
file="#settings.csvfile#"
variable="autodata"
charset="utf-8">
Django
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql',
...
'OPTIONS': {
'charset': 'utf8mb4',
'use_unicode': True, },
},
}
my.cnf:
[mysqld]
character-set-server=utf8mb4
default-collation=utf8mb4_unicode_ci
[client]
default-character-set=utf8mb4
StackOverflow
client_encoding: 'UTF8' Django utf8 usage
The Django container (we use docker) was missing the LANG env var. Setting LANG to en_US.UTF-8 and restarting the container fixed the problem.
See also the Python notes, above.
Drupal, WordPress, and any other product that uses PHP: Turn off mbstring.encoding_translation. Probably everything to do with mbstring should be avoided.
amending your settings.php database connection by adding:
$databases['default']['default'] = array(
'driver' => 'mysql',
'database' => 'databasename',
'username' => 'username',
'password' => 'password',
'host' => 'localhost',
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_general_ci',
);
More discussion of utf8mb4
go-mysql
[username[:password]@][protocol[(address)]]/dbname?collation=utf8mb4_unicode_ci https://github.com/go-sql-driver/mysql#charset
Grails Grails and utf8mb4
Remove the "characterEncoding=UTF" part in the dataSource url. Otherwise grails always forces MySQL to use UTF-8 instead of utf8mb4. E.g.
dataSource:
dbCreate: "update"
url: "jdbc:mysql://localhost:8889/mydbname?useUnicode=true&zeroDateTimeBehavior=convertToNull&autoReconnect=true"
Heidi
When importing, don't use "auto-detect", explicitly request "UTF-8" if appropriate. Heidi import
Java
Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0. ucs2 versus utf16
⚈ What if you put a 5.1 slave off a 4.x master? (The German sharp-s could cause trouble.)
⚈ Replication type (SBR vs RBR)?
Add ?useUnicode=yes&characterEncoding=UTF-8 to the JDBC URL (or maybe it is =true and =utf8)
<%@ page language="java" pageEncoding="UTF-8"%>
<%@ page contentType="text/html;charset=UTF-8" %>
compileJava.options.encoding = 'UTF-8'
<form method="post" action="/your/url/" accept-charset="UTF-8">
To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4. Connector/J will then use that setting as long as characterEncoding has not been set in the connection string. This is equivalent to autodetection of the character set.
utf8mb4 needs special handling in Connector/J
"For Connector/J 8.0.12 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for characterEncoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.'
Jetty
jetty
Java Servlet
resource.setContentType ("text/html;charset=utf-8");
JSP
<%@ page contentType="text/html; charset=UTF-8" %>
Laravel
If you have trouble with 191, etc in Laravel, you have these choices
⚈ 255 -> 191 (but make sure you don't currently have longer addresses)
⚈ utf8mb4 -> utf8 (Caution: This disallows Emoji and some Chinese characters) See config/database.php
⚈ INDEX(email(20)) (If it is UNIQUE or PRIMARY KEY, do not pick this option)
⚈ It is possible to reconfigure (if after 5.6.3) the server and table to allow bigger indexes; this is not the 'default' until 5.7.7.
Stackoverflow
In the file config/database.php: 'mysql' => [..., 'charset' => 'utf8mb4', 'collation' => 'utf8mb4_unicode_ci', ...]
If you would like to log the MYSQL of the query you can use this: \DB::enableQueryLog(); before the query and this: logger(\DB::getQueryLog() ); afterwards. This will save the query to your laravel log file Navicat
Old versions of Navicat have an option for utf8, but not for utf8mb4. So, if you need utf8mb4,... In the encoding options for navicat connections, pick "use mysql encoding", not "utf8". Forum
mysql commandline tool mysql --default-characterset=utf8mb4 and (or)
[mysql]
default-character-set = utf8mb4
StackOverflow
NET
Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword;CharSet=utf8;
node.js var connection = mysql.createConnection({ ... , charset : 'utf8mb4'}); stackoverflow
utf8 package
OrmLite ConnectionReset=False;
Powershell ISE and Powershell Console: $OutputEncoding = New-Object -typename System.Text.UTF8Encoding
library(RODBC)
imp <- odbcConnect("SERVER", uid = "***", pwd="***",DBMSencoding="utf8")
ODBC 3.5.1 probably won't handle utf8; either of these may be correct:
ODBC;DRIVER={MySQL ODBC 5.1 Driver};UID=...;PWD=...;PORT=3306;DATABASE=...;SERVER=127.0.0.1;CHARSET=UTF8;
strConnection="driver={MySQL ODBC 5.1 Driver}; server=...;uid=...;pwd=...;database=...;stmt=SET NAMES utf8;"
Unicode is not supported in ADO applications (including ASP) with the Connector/ODBC 3.51 driver. Please consider testing MySQL Connector/ODBC 5.1 which includes Unicode support for ASP.
Panda:
Stackoverflow on Panda
See also notes on Python.
df.to_csv('file.csv',encoding='utf-8-sig')
Panda Release 0.17.1 [on Nov 13, 2015](#event-463745727) fixes an issue with non-ascii characters in column (etc) names.
Perl:
use utf8;
use open ':std', ':encoding(UTF-8)';
my $dbh = DBI->connect("dbi:mysql:".$dsn, $user, $password, {
PrintError => 0,
RaiseError => 1,
mysql_enable_utf8 => 1, # Switch to UTF-8 for communication and decode.
});
# or {mysql_enable_utf8mb4 => 1} if using utf8mb4
Stackoverflow - double encoding and JSON
Perl and UTF-8
A workaround for an issue in Perl 5: binmode(STDOUT, ':unix:encoding(utf8):crlf');
PHP see above
pymysql:
import pymysql
con = pymysql.connect(host='127.0.0.1', port=3306,
user='root', passwd='******',
charset="utf8mb4")
Python see above
R / RStudio
Tool -> Global Options -> Code -> Saving and put UTF-8
rs <- dbSendQuery(con, 'set character set "utf8"')
rs <- dbSendQuery(con, 'SET NAMES utf8')
options(encoding = "UTF-8") at the top of my main script from which I call my package seems to fix the issue with having non-ascii characters in my package code.
read_chunk(lines = readLines("TestSpanishText.R", encoding = "UTF-8")) (also file())
[https://support.rstudio.com/hc/en-us/ar ... r-Encoding][Character Encoding]]
Rails: Rails and utf8mb4
Ruby on Rails:
"\xD0\x9A\xD0\xB0\xD1\x81\xD0\xBB".force_encoding("UTF-8") => "Щасл"
Watch out for rake db:reset . Consider, instead, rake db:drop && rake db:create && rake db:migrate . The reset just created the tables from the already stored schema which had the wrong charset and collation. But re-creating them from the migrations created them with the new charset and collation.
RoR: application_controller.rb
def configure_charsets
response.headers["Content-Type"] = "text/html; charset=utf-8"
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end
end
Edit your database.yml with below code line encoding: utf8
RocksDB, MyRocks not yet answered
ServiceStack ConnectionReset=False;
Spring/Hibernate: (See also Best Practice, above.)
Hibernate XML:
<property name="hibernate.connection.CharSet">utf8mb4</property>
<property name="hibernate.connection.characterEncoding">UTF-8</property>
<property name="hibernate.connection.useUnicode">true</property>
Connection url:
db.url=jdbc:mysql://localhost:3306/db_name?useUnicode=true&character_set_server=utf8mb4
The above changes were enough for me to upgrade from utf8 to utf8mb4 charset scheme.
As a side note I would like to make one clarification that UTF-8 is the character encoding while utf8mb4 is a character set that MySQL supports. MySQL's utf8mb4 is a superset to MySQL's utf8.
Spring/Hibernate filter:
<form accept-charset="UTF-8">
Spring/Hibernate: <property name="url" value="jdbc:mysql://localhost:3306/miniprojetjee?useUnicode=true&connectionCollation=utf8_general_ci&characterSetResults=utf8&characterEncoding=utf-8"/> (or maybe it is =yes)
"Spring": @RequestMapping(value = "/getRegion2", produces={"application/json; charset=UTF-8"},method = RequestMethod.GET)
Hibernate manual
Spring MVC UTF-8 Encoding
<filter>
<filter-name>encoding-filter</filter-name>
<filter-class>
org.springframework.web.filter.CharacterEncodingFilter
</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
<init-param>
<param-name>forceEncoding</param-name>
<param-value>true</param-value>
</init-param>
</filter>
Spring Boot
Make sure you register Spring's CharacterEncodingFilter in your web.xml
Load the lib hibernate-spatial in pom.xml, and set the config spring.jpa.properties.hibernate.dialect: org.hibernate.dialect.MySQL5InnoDBDialect in application.yml Reference
sqlalchemy:
db_url = sqlalchemy.engine.url.URL(drivername='mysql', host=foo.db_host,
database=db_schema,
query={ 'read_default_file' : foo.db_config, 'charset': 'utf8mb4' })
MySQL-Unicode in sqlalchemy
engine = create_engine('mysql://root:@localhost/testdb?charset=utf8', encoding = 'utf-8')
sqoop
sqoop import --connect jdbc:mysql://server.foo.com/db --table bar --direct -- --default-character-set=latin1 (Probably the latin1 should be replaced by the encoding you have.)
Tomcat
tomcat
Wildfly: Configuration -> Subsystems -> Undertow -> Servlet/JSP -> View -> Set "User listener encoding" to true.
WordPress: in wp-config.php, define('DB_CHARSET', 'utf8mb4'); and define('DB_COLLATE', ''); WordPress & utf8mb4
Tomcat/servlet/JSP/etc settings
vb.net / VBScript:
Dim MyConn As New MySqlConnection("Server=localhost;User Id=u;Password=x;Database=d;Charset=utf8")
<%@language="VBSCRIPT" CODEPAGE="65001" LCID=1033%>
Console.OutputEncoding = System.Text.Encoding.UTF8
web2py
db = DAL('mysql://username:password@localhost/test?set_encoding=utf8mb4')
web.xml
web.xml
Hex 80-9F of cp1252 (Windows-1252 vs ISO-8859-15) latin1 codes
MySQL 8.0
(Source)
This is out of date
Collations to add:
We are going to add following collations. The collation name contains:
a. character set name: "utf8mb4"
b. language's iso code: for example, "cs" for Czech
c. UCA version: "800"
d. accent / case insensitive: "ai_ci"
The complete list is:
Collation name language
------------------------------------------------------------
utf8mb4_cs_800_ai_ci Czech
utf8mb4_da_800_ai_ci Danish
utf8mb4_de_phonebook_800_ai_ci German (phonebook order)
utf8mb4_eo_800_ai_ci Esperanto
utf8mb4_es_800_ai_ci Spanish
utf8mb4_es_traditional_800_ai_ci Spanish (traditional)
utf8mb4_et_800_ai_ci Estonian
utf8mb4_hr_800_ai_ci Croatian
utf8mb4_hu_800_ai_ci Hungarian
utf8mb4_is_800_ai_ci Icelandic
utf8mb4_la_800_ai_ci Roman (classical Latin)
utf8mb4_lt_800_ai_ci Lithuanian
utf8mb4_lv_800_ai_ci Latvian
utf8mb4_pl_800_ai_ci Polish
utf8mb4_ro_800_ai_ci Romanian
utf8mb4_sk_800_ai_ci Slovak
utf8mb4_sl_800_ai_ci Slovenian
utf8mb4_sv_800_ai_ci Swedish
utf8mb4_tr_800_ai_ci Turkish
utf8mb4_vi_800_ai_ci Vietnamese
Instead, see (as of 8.0.0) utf8mb4 collations
Random notes
UTF-8 all the way through
Scan for 8-bit codes, find and decipher double-encoding
A table declared to be latin1, and containing latin1 bytes can be converted to utf8 via
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8;
Example from forums
Using Replication to assist in charset conversion?
utf8_general_ci, which will treat Résumé the same as resume MYSQL doesn't 'see' accents
Exact match when _ci collation is on column - even using index
WHERE keyword = CONVERT('cho' USING binary)
Re: constants in SQL:
EXPLAIN EXTENDED SELECT ...;
SHOW WARNINGS;
sometimes gives extra clues into what charset/collation issues are being faced during a SELECT. It will probably contain 'introducers'; example: _utf8'abc'.
To ponder: convert(convert((binary col1) using latin1) using utf8)
phpMyAdmin might be connecting as "root", thereby bypassing init_connect = "SET NAMES utf8". This could defeat your attempts to be fully utf8. Possible workaround: Use a login that avoids SUPER priv. (This suggestion is good advice for all applications -- for security reasons.)
# Find rows with any 8-bit chars:
SELECT * FROM tbl WHERE HEX(colname) RLIKE '^(..)*[8-F].';
# Remove all occurrences of one 8-bit code (hex 92, in this example):
UPDATE tbl SET colname = REPLACE(colname, UNHEX('92'), '');
Reference: how to replace hex in varchar field
Converting latin1 hex to utf8 hex: HEX(CONVERT(CONVERT(UNHEX('F6') USING LATIN1) USING UTF8)) ⇒ C3B6
(Note: The doc refers to version 5.5, but probably applies to all versions from 4.1 on.)
Connector/.NET possibly defaults to latin1. Consider changing 'connection string' if you need utf8.
Suggested for my.cnf by stackoverflow
stackoverflow
[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
character-set-server = utf8
[mysqld]
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8
Caution: Setting default-character-set that way for the mysql client cannot be overridden by SET NAMES.
my.cnf: skip-character-set-client-handshake to pretend to be 4.0?
A possible double-encoding fix
$string = htmlspecialchars($string, ENT_QUOTES, "UTF-8");
my.cnf needs these to force the server and the client to use UTF8:
skip-character-set-client-handshake
collation-server=utf8_unicode_ci
character-set-server=utf8
However, "The server character set and collation are used as default values if the database character set and collation are not specified in CREATE DATABASE statements. They have no other purpose." manual on charset-server
It is not clear whether skip-character-set-client-handshake is really needed.
For a description of the character_set_* settings and SET NAMES: manual on Connection Character Sets
In 5.6, this helps get Unicode rendering for mysql commandline tool for Windows: mysql --default-character-set=utf8 Windows Unicode Support
Also sugest (in my.ini)
[mysql]
default-character-set=utf8
<?xml version="1.0" encoding="utf-8"?>
REGEXP (RLIKE) with character class is likely to work wrong: Bug report from 2007
Manual Warning
<form method="post" action="/your/url/" accept-charset="UTF-8">
Changing default charset
Possible values for <meta http-equiv="Content-Type" content="text/html; charset=..." /> can be found in iana charsets
For example, use 'ISO-8859-9' for MySQL's 'latin5'.
How to fix a table that has a mixture of latin1 and utf8
Test for Chinese characters: REGEXP '^(..)*(E[2-9F]|F0A)' -- Caveat: This may not be precise. StackOverflow discussion
In my.cnf:
collation_server = utf8mb4_unicode_520_ci
sets
character_set_database | utf8mb4
character_set_server | utf8mb4
collation_database | utf8mb4_unicode_520_ci
collation_server | utf8mb4_unicode_520_ci
but not the things set by SET NAMES, nor collation_connection.
From language name to list of characters (dwell to see unicode, not utf8)
Illustrive examples of ALTERs
5.5.3 (10/2010) Changelog: The following obsolete constructs have been removed. The --default-character-set and --default-collation server options (use --character-set-server and --collation-server). Then...
5.7.6 (9/2015) Changelog: The following items are deprecated and will be removed in a future MySQL release. The global character_set_database and collation_database system variables are deprecated and will be removed in a future version of MySQL. Assigning a value to the session character_set_database and collation_database system variables is deprecated and assignments produce a warning. The session variables will become read only in a future version of MySQL and assignments will produce an error. It will remain possible to access the session variables to determine the database character set and collation for the default database.
LOAD DATA added syntax to include charset valid for the file to be loaded. In 4.1, LOAD DATA utilises database charset. Bug 10195
Check for capitalized Swedish words
shows the difficulty of using REGEXP, and provides a workaround.
When a Stored Routine is created, it has the current CHARACTER SET and COLLATION attached to it. Beware -- this may not be what you expect later. Use SHOW PROCEDURE name to see the setting.
"ch and ll are no longer letters since 1994" - This probably means that "spanish2" collations are not needed for current texts. Reference
(in Spanish).
LIKE != = -- (a LIKE b) is almost the same as (a = b) in that collations matter. However trailing spaces can lead to different actions. And 2-byte letters (ae) are treated as two characters in LIKE, but perhaps one character, such as when compared to (ä) with utf8_german2_ci. SO Discussion
Diacritics used in Europe
$str = mb_convert_encoding($str, 'UTF-8', 'HTML-ENTITIES');
Changing case
Examples of mismatch of utf8 and utf8mb4
Search for emoji
Concerning LOAD DATA INFILE: The server uses the character set indicated by the character_set_database system variable to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause. A character set of binary specifies “no conversion.”
Transliterate accented and Cyrilic characters into latin
A messy case where double-encoding was visible
Stored function to decode unicode escapes
ALTER TABLE mytable DEFAULT CHARACTER SET utf8 can create double encoding
If having trouble importing utf8mb4 characters, give this a try:
USE my_example_db_name;
# Select the right charset
SET NAMES 'utf8mb4';
# Import from SQL file
SOURCE /somewhere/dump.sql;
# Disconnect from SQL server
EXIT
Footnote
If this solves your problem, please let me know. My ego needs a pat on the back. If it does not solve your problem, please contact me, maybe I can pull as special rabbit out of my hat, just for you.
Original writing -- Sep 2010, after struggling with hundreds of questions both in-house and from the forums. Alas, this does not achieve the ultimate goal of "Given this symptom, do that". (See 2015) I hope it helps most people.
Revision date: 2012-08-26 (This document continues to evolve as more problems and examples are discovered.)
Revision date: 2013-02-12 (Misc, plus added 8-bit encodings)
Revision date: 2015-March/April Finally getting close to "If you see this, this is what happened, and here's how to fix it". Add better summary of what errors occur and how to fix them. Add/fix several links.
Revision date: 2016-01-25 Shuffled language info.
Revision date: 2016-09-14 Beginnings of 8.0, including utf8mb4...unicode_800_ci
-- Rick James
MySQL Documents by Rick James
HowTo Techniques for Optimizing Tough Tasks:
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
Overview Summary Tables High speed ingestion
Entity-Attribute-Value -- a common, poorly performing, design pattern (EAV); plus an alternative
5 methods for 'Find Nearest'
Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
Lat/Long representation choices
Z-Order 'find nearest'(under construction)
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Rollup Unique User Counts
Alter of a Huge table -- Mostly obviated by 5.6
Latest 10 news articles -- how to optimize the schema and code for such
Build and execute a "Pivot" SELECT (showing rows as columns)
Find largest row for each group ("groupwise max")
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8 If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com .
Python tips, PHP tips, other language tips
utf8 Collations utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table wp_postmeta UNION+OFFSET
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Request for tuning / slowlog info
Best of MySQL Forum -- index of lots of tips, discussions, etc
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS Analyze SlowLog
My slides from conferences
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
(older ones upon request)
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request: View Rick James's profile on LinkedIn
Did my articles help you out? Like what you see? Consider donating:
Buy me a Banana Latte ($4) There is no obligation but it would put a utf8mb4 smiley on my face, instead of the Mojibake "🙂"