Reputation: 37
What I'm doing
I'm working on a chat application (written in PHP), which allows users to create their own chat rooms. A user may name a chat room anything they like and this name is passed on to the MySQL database in a prepared statement as the table name for that respective chat room.
It is understood that there is no log in / security measure for this application and the table holding the chat log is composed of records with simply the user submitted text and timestamp (2 columns, without counting an AUTO_INCREMENT
primary key).
What I'm facing
Given the simple nature of this application, I don't have the intention of changing the structure of the database, but I'm now running into the issue when a user enters emoji (or other supplementary characters) as the name for their own chat room. Passing such information on to the database as is will convert the characters into question marks, due to the way MySQL works internally (https://dev.mysql.com/doc/refman/5.7/en/identifiers.html):
Identifiers are converted to Unicode internally. [..] ASCII NUL (U+0000) and supplementary characters (U+10000 and higher) are not permitted in quoted or unquoted identifiers.
What should / can I do to avoid this problem? Is there a best practice for "escaping" / "sanitizing" user input in a situation like this? I put the respective words in quotation marks because I know it is not the proper / typical way of handling user input in a database.
What I'm trying
An idea I had was using rawurlencode()
to literally break down the supplementary characters into unique sequences that I can pass on to the database and still be sure that a chat room with the name ðŸ¤
is not confused with 🤔
. However, I have the impression based on this answer that this is not good practice: https://stackoverflow.com/a/8700296/1564356.
Tackling this issue another way, I thought of base64_encode()
, but again based on this answer it is not an ideal approach: https://stackoverflow.com/a/24175941/1564356. I'm wondering however, if in this case it would still be an acceptable one.
A third option would be to construct the database in a different way by issuing unique identifiers as the table names for each respective chat room and storing the utf8mb4
compatible string in a column. Then a second table with the actual chat log can be linked with a foreign key. This however complicates the structure of the database and doubles the amount of tables required. I'm not a fan of this approach.
Any ideas? Thanks!
Upvotes: 0
Views: 249
Reputation: 142528
CHARACTER SET utf8mb4
is needed end-to-end for MySQL in order to store Emoji and some Chinese characters.
In this you will find more on "best practice" and debugging tips when you fail to follow the best practice. It's not just the column charset, it is also the client's charset.
Do not use any encode/decode routines; it only makes the mess worse.
It is best to put the actual characters in MySQL tables, not Unicode strings like U+1F914
or \u1F914
, etc.
🤔
is 4 bytes of hex F09FA494
when encoded in UTF-8 (aka MySQL's utf8mb4).
And, I agree with IMSoP; don't dynamically create table.
SQL Injection should be countered with mysqli_real_escape_string (or equivalent, depending on the API), not urlencode or base64.
Upvotes: 0
Reputation: 97996
Dynamically created tables, regardless of their naming scheme, are very rarely a sensible design choice. They make every single query you write more complicated, and eliminate a large part of the usefulness of SQL as a language and relational databases as a concept.
Furthermore, allowing users to directly choose table names sounds like a security disaster waiting to happen. Prepared statements will not save you in any way, because the table name is considered part of the query, not part of the data.
Unless you have a very compelling reason for such an unusual design, I would strongly recommend changing to have a single table of chat_logs
, with a column of chat_room_id
which references a chat_rooms
table. The chat_rooms
table can then contain the name, which can contain any characters the user wants, along with additional data about the room - creation date, description, extra features, etc. This approach requires exactly 2 tables, however many chat rooms are created.
If you really think you need the separate table for each chat room, because you're trying to do some clever partitioning / sharding, I would still recommend having a chat_rooms
table, and then you can simply name the tables after the chat_room_id
, e.g. chat_logs_1
, chat_logs_2
, etc. This approach requires exactly one more table than your current approach, i.e. num_tables = num_chat_rooms + 1.
Upvotes: 2