Manuel
Manuel

Reputation: 13

MATLAB : Alphanumeric character string extraction

As a foreword, I have been searching for solutions to this, and I have tried a myriad of codes but none of them work for the specific case.

I have a variable that is the registration number of different UK firms. The data was originally from Stata, and I had to use a code to import non-numeric data into Matlab. This variable (regno) is numeric up until observation 18000 (approx). From then it becomes registration numbers with both letters and numbers.

I wrote a very crude loop that grabbed the initial variable (cell), took out the double quotations, and extracted the characters into a another matrix (double). The code is :

regno2 = strrep(regno,'"','');
regno3 = cell2mat(regno2);
regno4 = [];
    for i = 1:size(regno3,1);
    regno4(i,1) = str2double(regno3(i,1:8));
    end

For the variables with both letters and numbers I get NaN. I need the variables as a double in order to use them as dummy indicator variables in MatLab. Any ideas?

Thanks

Upvotes: 1

Views: 577

Answers (1)

Benoit_11
Benoit_11

Reputation: 13945

Ok I'm not entirely sure about whether you need letters all the time, but here regular expressions would likely perform what you want.

Here is a simple example to help you get started; in this case I use regexp to locate the numbers in your entries.

clear

%// Create dummy entries
Case1 = 'NI000166';
Case2 = '12ABC345';

%// Put them in a cell array, like what you have.
CasesCell = {Case1;Case2};

%// Use regexp to locate the numbers in the expression. This will give the indices of the numbers, i.e. their position within each entry. Note that regexp can operate on cell arrays, which is useful to us here.
NumberIndices = regexp(CasesCell,'\d');

%// Here we use cellfun to fetch the actual values in each entry, based on the indices calculated above.
NumbersCell = cellfun(@(x,y) x(y),CasesCell,NumberIndices,'uni',0)

Now NumbersCell looks like this:

NumbersCell = 

    '000166'
    '12345'

You can convert it to a number with str2num (or srt2double) and you're good to go.

Note that in the case in which you have 00001234 or SC001234, the values given by regexp would be considered as different so that would not cause a problem. If the variables are of different lenghts and you then have similar numbers, then you would need to add a bit of code with regexp to consider the letters. Hope that helps! If you need clarifications or if I misunderstood something please tell me!

Upvotes: 1

Related Questions