2014-12-18

UTF8MB4 character set in Amazon RDS MySQL server and SQLAlchemy client

I've recently had to move a massive dataset that includes UTF-8 strings which contains extended set code points (i.e. planes other than the Basic Multilingual Plane, including Emoji) into a MySQL database hosted using Amazon's RDS.

Even though all of the databases and tables were configured to use purely utf8 character set and the unicode_ci collation, and though SQLAlchemy was also configured to use UTF8, I quickly ran into issues:

Incorrect string value: '\xF0\x9D\x8C\x86' for column 'column_name' at row 1 

The solution was:
  1. Read Mathias Bynens' awesome tutorial: The things we do to store U+01F4A9 PILE OF POO () correctly 
  2. After you created a backup of your current database, change the MySQL server settings via the Parameter Groups section of the RDS console:
    • Click "Create DB Parameter Group" 
    • Choose the correct Group Family (probably mysql5.6)
    • Input a group name and description (probably "mysql5.6-utf8mb4" and "MySQL 5.6 using UTF8MB4 charset by default"
    • Select this new Parameter Group in the console and click "Edit Parameters". Set the following parameter values:
      character_set_client: utf8mb4
      character_set_database: utf8mb4
      character_set_results: utf8mb4
      character_set_connection: utf8mb4
      collation_connection: utf8mb4_unicode_ci
      collation_server: utf8mb4_unicode_ci
      character_set_server: utf8mb4
      and click "Save Changes".
    • Go to the "Instances" dashboard, right click your RDS instance and "Modify" it, change the "Parameter Group" option to your newly created Parameter Group and click "Continue", "Modify DB Instance".
    • You can "Reboot" the instance if you want to be extra sure the new configuration was loaded.
  3. (Optional) Change the MySQL client settings. For the CLI mysql client, edit /etc/mysql/my.cnf and under [client] add:
    [client]
    default-character-set = utf8mb4
    This is to allow proper viewing of data using the mysql tool.
  4. Modify your existing databases, tables and columns to use UTF8MB4, as explained in the tutorial in part 1.
  5. Modify your SQLAlchemy connection string from:
    mysql+mysqldb://user:password@host:port/dbname
    to:
    mysql+mysqldb://user:password@host:port/dbname?charset=utf8mb4
    (whether or not to add use_unicode=0 is left for the programmer's discretion.)
Enjoy.

5 comments:

  1. Nice post, I bookmark your blog because I found very good information on your blog, Thanks for sharing more information. Regards aws jobs in hyderabad.

    ReplyDelete
  2. Your website is really cool and this is a great inspiring article.
    Amazon

    ReplyDelete
  3. Hi Alon,

    Can you please publish the link for part 1 of this blog ?

    ReplyDelete
  4. Tried this, somhow doesn't seem to change the character_set_client, character_set_connection, character_set_results values. The other variables got changed, just not these. Any idea why?

    ReplyDelete
  5. Nice work, your blog is concept-oriented, kindly share more blogs like this
    AWS Online Course

    ReplyDelete