Brandon - Free Palestine
Brandon - Free Palestine

Reputation: 16666

What are some valid and invalid UTF-8 strings I can use for my unit tests?

I wrote two functions in PHP, str_to_utf8() and seems_utf8() (Well they are comprised of parts I borrowed from other code). Now I'm writing unit tests for them and I want to make sure I have proper unit tests. I current took the ones I have from Facebook:

public function test_str_to_utf8()
{
    // Make sure ASCII characters are ignored
    $this->assertEquals( "this\x01 is a \x7f test string", str_to_utf8( "this\x01 is a \x7f test string" ) );

    // Make sure UTF8 characters are ignored
    $this->assertEquals( "\xc3\x9c \xc3\xbc \xe6\x9d\xb1!", str_to_utf8( "\xc3\x9c \xc3\xbc \xe6\x9d\xb1!" ) );

    // Test long strings
    #str_to_utf8( str_repeat( 'x', 1024 * 1024 ) );
    $this->assertEquals( TRUE, TRUE );

    // Test some invalid UTF8 to see if it is properly fixed
    $input = "\xc3 this has \xe6\x9d some invalid utf8 \xe6";
    $expect = "\xEF\xBF\xBD this has \xEF\xBF\xBD\xEF\xBF\xBD some invalid utf8 \xEF\xBF\xBD";
    $this->assertEquals( $expect, str_to_utf8( $input ) );
}

Are those valid test cases?

Upvotes: 1

Views: 742

Answers (1)

CAMason
CAMason

Reputation: 1122

I find this resource useful when testing UTF-8.

If you use any of the non-latin-1 text, you'll need to either ensure your PHP file is saved as UTF-8, or pre-escape them

Upvotes: 1

Related Questions