Ruslan
Ruslan

Reputation: 51

Mysql string check on equals is false for the same values

I have a problem with MySql I have a table with parsed informations from websites. A strange string interpretation appear:

the query

select id, address from pagesjaunes_test where address = substr(address,1,length(address)-1)

return a set of values instead of none

at beginning I executed functions as:

address = replace(address, '\n', '')
address = replace(address, '\t', '')
address = replace(address, '\r', '')
address = replace(address, '\r\n', '')
address = trim(address)

but the problem still persist.

Values of field 'address' have some french chars , but the query returned also values that contains only alfanumeric english chars.

Another test: I tried to check the length of strings and ... the strlen() from PHP and LENGTH() from MYSQL display different results! Somewhere difference is by 2 chars, somewhere by 1 character without a specific "rule".

Visual I can't see any space or tabs or something else.

After I modified an address manualy(I deleted all string and I wrote it again), the problem is solved, but I have ~ 6000 values, so this is not a solution :)

What can be the problem?

I suppose that strings can have something as an "empty char", but how to detect and remove it?

Thanks

P.S. the problem is not just length. I need to join this table with other one and using a condition that check if values from fields 'address' are equals. Even if the fields have the same collation and tables have the same collation, query returns that no addresses match

E.g.

For query:

SELECT p.address,char_length(p.address) , r.address, char_length(r.address) 
FROM `pagesjaunes_test` p
LEFT JOIN restaurants r on p.name=r.name
WHERE  
p.postal_code=r.postal_code 
and p.address!=r.address
and p.phone='' 
and p.cuisines=''
LIMIT 10

So: p.address!=r.address

The result is:

+-------------------------------------+------------------------+--------------------------+------------------------+
| address                             | char_length(p.address) | address                  | char_length(r.address) |
+-------------------------------------+------------------------+--------------------------+------------------------+
| Dupin Marc13 quai Grands Augustins  |                     34 | 13 quai Grands Augustins |                     24 |
| 39 r Montpensier                    |                     16 | 39 r Montpensier         |                     16 |
| 8 r Lord Byron                      |                     14 | 3 r Balzac               |                     10 |
| 162 r Vaugirard                     |                     15 | 162 r Vaugirard          |                     15 |
| 32 r Goutte d'Or                    |                     16 | 32 r Goutte d'Or         |                     16 |
| 2 r Casimir Périer                  |                     18 | 2 r Casimir Périer       |                     18 |
| 20 r Saussier Leroy                 |                     19 | 20 r Saussier Leroy      |                     19 |
| Senes Douglas22 r Greneta           |                     25 | 22 r Greneta             |                     12 |
| Ngov Ly Mey44 r Tolbiac             |                     23 | 44 r Tolbiac             |                     12 |
| 33 r N-D de Nazareth                |                     20 | 33 r N-D de Nazareth     |                     20 |
+-------------------------------------+------------------------+--------------------------+------------------------+

As you see, "162 r Vaugirard", "20 r Saussier Leroy" contains only ASCII chars, have the same length but aren't equals!

Upvotes: 0

Views: 802

Answers (3)

Ruslan
Ruslan

Reputation: 51

Finally, I found the problem. After changed collation to ascii_general_ci all non-ascii chars was transformed to "?". Some spaces also was replaced with "?". After check initial values, function ORD() from MySQL returned 160 (instead of 32) for these spaces. So,

UPDATE pagesjaunes_test SET address = TRIM(REPLACE(REPLACE(address, CHAR(160), ' '), '  ',' ')

resolved my question.

Upvotes: 0

Mosty Mostacho
Mosty Mostacho

Reputation: 43474

The official documentation says:

Returns the length of the string str, measured in bytes. A multi-byte character counts as multiple bytes. This means that for a string containing five two-byte characters, LENGTH() returns 10, whereas CHAR_LENGTH() returns 5.

So, use CHAR_LENGTH instead :)

select id, address from pagesjaunes_test
where address = substr(address, 1, char_length(address) - 1)

Upvotes: 1

Johannes Jarolim
Johannes Jarolim

Reputation: 35

Maybe have a look at the encoding of the mysql text fields - UTF8 encodes most of its characters with 2 bytes - only a small subset of UTF8 (ASCII characters for example) get encoded with one byte.

MySQL knows UTF8 and counts right. PHP text functions aren't UTF8 aware and count the bytes itself.

So if PHP counts more than MYSQL, this is probably the cause and you could have a look at utf8decode.

br from Salzburg!

Upvotes: 3

Related Questions