Mysql Does Not Load Accents Despite Utf8

April 3, 2014

Have you ever worked with a database that had data with accents? You’ve exhausted every option from the file encoding to the database collation to anything else yet you still get that pesky black diamond with the blasted question mark. What if I told you there was one more way?

That way is the default character_set_server value. It’s most likely not utf8mb4 and most likely latin1.

Run this query in the database:

mysql> show variables like "%character%";
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Notice something fishy? Yep. It’s that latin1. If the data coming from the database is utf8, how will mysqld handle it? latin style but since it’s not the same – error. Let’s change that to utf8mb4.

Wait a minute, utf8mb4?

Since MySQL 5.5.3 you should use utf8mb4 rather than utf8. While they both refer to the UTF-8 encoding, the older utf8 had a MySQL-specific limitation preventing use of characters numbered above 0xFFFD whereas utf8mb4 was the refactored version.

##Fixing it

1. Set default character set

Run this SQL query to set utf8mb4 as your default database character set:

ALTER DATABASE db_name CHARACTER SET utf8mb4;
ALTER DATABASE db_name COLLATE 'utf8mb4_general_ci';

If that didn’t work – keep reading.

On a Windows machine?

When you run these queries on a database hosted in a Windows machine, you may run into trouble:

ALTER DATABASE db_name CHARACTER SET utf8mb4;
ALTER DATABASE db_name COLLATE 'utf8mb4_general_ci';

This is because of the way Windows locks their files. Nonetheless, we have one more way here:

  1. Dump that database (structure and content, no CREATE info)
  2. Drop that database
  3. Create a database with the utf8mb4 character set and collation
  4. Import the dump into it

Tiresome, I know. Did that work? No? Let’s move on.

2. Alter the my.cnf config file

You can append this to the configuration file that mysql uses. It should be in the /etc/mysql directory or somewhere there:

[mysqld]
init_connect='SET collation_connection = utf8mb4_general_ci'
init_connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
collation-server=utf8mb4_general_ci
skip-character-set-client-handshake

The most important line is skip-character-set-client-handshake. If the client requests a connection in utf8, it would still use utf8mb4. This is important because a lot of people still use utf8. If you want that fire emoji in your database, you need to have utf8mb4.

This configuration tells mysqld what queries to run and what values are to be set. Simple stuff.

Restart MySQL:

sudo /etc/init.d/mysql restart

and there we have it:

mysql> show variables like "%character%";
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.03 sec)

Your database should should showing that fire emoji and those accented characters!

Diagnostics

To see the individual character set of a database, run this:

mysql> show create database db_name;
+----------+---------------------------------------------------------------+
| Database | Create Database                                               |
+----------+---------------------------------------------------------------+
| fbdb     | CREATE DATABASE `fbdb` /*!40100 DEFAULT CHARACTER SET utf8 */ |
+----------+---------------------------------------------------------------+
1 row in set (0.00 sec)

This database would have been created with the default character set of utf8 due to your settings.

You can also see which character sets are available to you based on the version of MySQL you carry:

mysql> show collation where charset='utf8mb4';
+------------------------+---------+-----+---------+----------+---------+
| Collation              | Charset | Id  | Default | Compiled | Sortlen |
+------------------------+---------+-----+---------+----------+---------+
| utf8mb4_general_ci     | utf8mb4 |  45 | Yes     | Yes      |       1 |
| utf8mb4_bin            | utf8mb4 |  46 |         | Yes      |       1 |
| utf8mb4_unicode_ci     | utf8mb4 | 224 |         | Yes      |       8 |
| utf8mb4_icelandic_ci   | utf8mb4 | 225 |         | Yes      |       8 |
| utf8mb4_latvian_ci     | utf8mb4 | 226 |         | Yes      |       8 |
| utf8mb4_romanian_ci    | utf8mb4 | 227 |         | Yes      |       8 |
| utf8mb4_slovenian_ci   | utf8mb4 | 228 |         | Yes      |       8 |
| utf8mb4_polish_ci      | utf8mb4 | 229 |         | Yes      |       8 |
| utf8mb4_estonian_ci    | utf8mb4 | 230 |         | Yes      |       8 |
| utf8mb4_spanish_ci     | utf8mb4 | 231 |         | Yes      |       8 |
| utf8mb4_swedish_ci     | utf8mb4 | 232 |         | Yes      |       8 |
| utf8mb4_turkish_ci     | utf8mb4 | 233 |         | Yes      |       8 |
| utf8mb4_czech_ci       | utf8mb4 | 234 |         | Yes      |       8 |
| utf8mb4_danish_ci      | utf8mb4 | 235 |         | Yes      |       8 |
| utf8mb4_lithuanian_ci  | utf8mb4 | 236 |         | Yes      |       8 |
| utf8mb4_slovak_ci      | utf8mb4 | 237 |         | Yes      |       8 |
| utf8mb4_spanish2_ci    | utf8mb4 | 238 |         | Yes      |       8 |
| utf8mb4_roman_ci       | utf8mb4 | 239 |         | Yes      |       8 |
| utf8mb4_persian_ci     | utf8mb4 | 240 |         | Yes      |       8 |
| utf8mb4_esperanto_ci   | utf8mb4 | 241 |         | Yes      |       8 |
| utf8mb4_hungarian_ci   | utf8mb4 | 242 |         | Yes      |       8 |
| utf8mb4_sinhala_ci     | utf8mb4 | 243 |         | Yes      |       8 |
| utf8mb4_german2_ci     | utf8mb4 | 244 |         | Yes      |       8 |
| utf8mb4_croatian_ci    | utf8mb4 | 245 |         | Yes      |       8 |
| utf8mb4_unicode_520_ci | utf8mb4 | 246 |         | Yes      |       8 |
| utf8mb4_vietnamese_ci  | utf8mb4 | 247 |         | Yes      |       8 |
+------------------------+---------+-----+---------+----------+---------+
26 rows in set (0.00 sec)