Steve Scher
Steve Scher

Reputation: 172

Removing non-numbers from a string in SPSS

Consider the following data:

Sample data - numbers mixed with text

As you can see, the values of the variable are inherently numeric, but include text in some of them. I have tried every permutation I could think of do repeat...end repeat to try and remove the non-numeric values and leave just the numbers, without success.

Is there some syntax that will do it? Is there a function that checks whether a substr contains any of a set of characters? Then I could create a set that represents all the digits, loop through each character in the string, and if it is not in the set, replace it with a null.

Upvotes: 1

Views: 1816

Answers (1)

horace_vr
horace_vr

Reputation: 3166

This answer on IBM support answers a somewhat similar question: https://www.ibm.com/support/pages/removing-unwanted-characters-strings

You will have a lot more characters to search (the whole a-z, A-Z and probably some non-letter characters as well), but it should work. You might also want to use the newer, CHAR.INDEX and CHAR.REPLACE functions, if you are using SPSS 223 or newer; see the official IBM SPSS documentation on them: https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/base/syn_transformation_expressions_string_functions.html

Later Edit (after clarifications and suggestions from the OP:

What you need to adjust in the IBM examples is 2 things:

  1. hardcode the loop exit after k iterations (not when #I=0 - that will stop at the first character it does not find). In the below example, k is set to 100.

  2. specify all characters you want to remove: a to z, space, quotation (as 2 consecutive quotation signs), and so on; anything you think you might want to clean. Then this should work (and indeed stackoverflow, formatting does not seem to be working properly at the moment)

    COMPUTE x=LOWER(x).

    LOOP k=1 to CHAR.LENGTH(x).

    COMPUTE #I = CHAR.INDEX(X,'abcdefghijklmnopqrstuvwxyz+, ''',1).

    IF #I > 0 X=CONCAT(CHAR.SUBSTR(X,1,#I-1), CHAR.SUBSTR(X,#I+1)).

    END LOOP.

    EXECUTE.

Upvotes: 2

Related Questions