Reputation:
Given the following code that attempts to create 2 folders in the current MATLAB path:
%%
u_path1 = native2unicode([107, 97, 116, 111, 95, 111, 117, 116, 111, 117], 'UTF-8'); % 'kato_outou'
u_path2 = native2unicode([233 129 142, 230 184 161, 229 191 156, 231 173 148], 'UTF-8'); % '過渡応答'
mkdir(u_path1);
mkdir(u_path2);
the first mkdir
call succeeds while the second fails, with the error message "The filename, directory name, or volume label syntax is incorrect". However, creating the folders manually in the "Current Folder" GUI panel ([right click]⇒New Folder⇒[paste name]) encounters no problem. This kind of glitches appear in most of MATLAB's low-level I/O functions (dir
, fopen
, copyfile
, movefile
etc.) and I'd like to use all these functions.
The environment is:
thus the filesystem supports Unicode chars in path, and MATLAB can store true Unicode strings (and not "fake" them).
The mkdir
official documentation elegantly{1} avoids the issue by stating that the correct syntax for calling the function is:
mkdir('folderName')
which suggests that the only officially supported call for the function is the one that uses string literals for folder name argument, and not string variables. That would also suggest the eval
way—which I'm testing to see if it's working as I write this post.
I wonder if there is a way to circumvent these limitations. I would be interested in solutions that:
don't rely on undocumented/unsupported MATLAB stuff;
don't involve system-wide changes (e.g changing operating system's locale info);
may rely eventually on non-native MATLAB libraries, as long the resulting handles/objects can be converted to MATLAB native objects and manipulated as such;
may rely eventually on manipulations of the paths that would render them usable by the standard MATLAB functions, even if Windows specific (e.g. short-name paths).
Later edit
What I'm looking for are implementations for the following functions, which will shadow the originals in the code that is already written:
function listing = dir(folder);
function [status,message,messageid] = mkdir(folder1,folder2);
function [status,message,messageid] = movefile(source,destination,flag);
function [status,message,messageid] = copyfile(source,destination,flag);
function [fileID, message] = fopen(filename, permission, machineformat, encoding);
function status = fclose(fileID);
function [A, count] = fread(fileID, sizeA, precision, skip, machineformat);
function count = fwrite(fileID, A, precision, skip, machineformat);
function status = feof(fileID);
function status = fseek(fileID, offset, origin);
function [C,position] = textscan(fileID, varargin); %'This one is going to be funny'
Not all the output types need to be interchangeable with the original MATLAB functions, however need to be consistent between function calls (eg fileID
between fopen
and fclose
). I am going update this declaration list with implementations as soon as I get/write them.
{1} for very loose meanings of the word "elegant".
Upvotes: 12
Views: 1555
Reputation: 24179
Some useful information on how MATLAB handles filenames (and characters in general) is available in the comments of this UndocumentedMatlab post (especially those by Steve Eddins, who works at MathWorks). In short:
"MathWorks began to convert all the string handling in the MATLAB code base to UTF-16 .... and we have approached it incrementally"
--Steve Eddins, December 2014.
This statement implies that the newer the version of MATLAB, the more features support UTF-16. This in turn means that if a possibility to update your version of MATLAB exists, it may be an easy solution to your problem.
Below is a list of functions that were tested by users on different platforms, according to the functionality that was requested in the question:
The following command creates a directory with UTF16 characters in its name ("תיקיה" in Hebrew, in this example) from within MATLAB:
java.io.File(fullfile(pwd,native2unicode(...
[255 254 234 5 217 5 231 5 217 5 212 5],'UTF-16'))).mkdir();
Tested on:
The following commands also seem to create directories successfully:
mkdir(native2unicode([255 254 234 5 217 5 231 5 217 5 212 5],'utf-16'));
mkdir(native2unicode([215,170,215,153,215,167,215,153,215,148],'utf-8'));
Tested on:
The following commands successfully open a file having unicode characters both in its name and as its content:
fid = fopen([native2unicode([255,254,231,5,213,5,209,5,229,5],'utf-16') '.txt']);
txt = textscan(fid,'%s');
Tested on:
feature('DefaultCharacterSet')
=> windows-1255
by Dev-iL. Note: the scanned text appears correctly in the Variables view. The text file can be edited and saved from the MATLAB editor with UTF characters intact.feature('DefaultCharacterSet')
is set to utf-8
before using textscan
, the output of celldisp(txt)
is displayed correctly. The same applies to the Variables view.Upvotes: 3
Reputation: 489
Try to use UTF-16 if you are on Windows because NTFS uses UTF-16 for filename encoding and Windows has two sets of APIs: the ones that work with so called 'Windows Codepages' (1250, 1251, 1252 etc.) and use C's char
data type and the ones that use wchar_t
. The latter type has a size of 2 bytes on Windows which is enough to store UTF-16 code units.
The reason your first call worked is because the first 128 code points in the Unicode Standard are encoded in UTF-8 identically to the 128 ASCII characters (which is made on purpose for backwards compatibility). UTF-8 uses 1-byte code units (instead of 2-byte code units for UTF-16) and usually software such as MATLAB does not process filenames so they need to just store byte-sequences and pass them to the OS APIs. The second call failed, because the UTF-8 byte-sequences representing code points are probably filtered out by Windows because some byte-values are prohibited in filenames. On POSIX-conformant operating systems most APIs are byte-oriented and the standard pretty much prevents you from using the existing multibyte encodings in APIs (e.g., UTF-16, UTF-32) and you have to use char*
APIs and encodings with 1-byte code units:
POSIX.1-2008 places only the following requirements on the encoded values of the characters in the portable character set:
...
- The encoded values associated with and shall be invariant across all locales supported by the implementation.
- The encoded values associated with the members of the portable character set are each represented in a single byte. Moreover, if the value is stored in an object of C-language type char, it is guaranteed to be positive (except the NUL, which is always zero).
Not all POSIX-conformant operating systems validate filenames other than for period or slash so you can pretty much store garbage in filenames. Mac OS X, as a POSIX system, uses byte-oriented (char*
) APIs but the underlying HFS+ uses UTF-16 in the NFD (Normalization Form D), so some processing is done at the OS-level before saving a filename.
Windows does not perform any type of Unicode normalization and stores filenames in whatever form they are passed in UTF-16 (provided NTFS is used) or Windows Codepages (not sure how they handle this on the filesystem level - probably by conversion).
So, how does this relate to MATLAB? Well it is cross-platform and has to deal with many issues because of that. One of them is that Windows has char
APIs for Windows Codepages and certain forbidden characters in filenames while other OSes do not. They could implement system-dependent checks but that would be much harder to test and support (much code churning I guess).
My best advise is to use UTF-16 on Windows, implement platform-dependent checks or use ASCII if you need portability.
Upvotes: 0