Johnzzz
Johnzzz

Reputation: 119

determine "type of value" from a string in python

I'm trying to write a function in python, which will determine what type of value is in string; for example

if in string is 1 or 0 or True or False the value is BIT

if in string is 0-9*, the value is INT

if in string is 0-9+.0-9+ the value is float

if in string is stg more (text, etc) value is text

so far i have stg like

def dataType(string):

 odp=''
 patternBIT=re.compile('[01]')
 patternINT=re.compile('[0-9]+')
 patternFLOAT=re.compile('[0-9]+\.[0-9]+')
 patternTEXT=re.compile('[a-zA-Z0-9]+')
 if patternTEXT.match(string):
     odp= "text"
 if patternFLOAT.match(string):
     odp= "FLOAT"
 if patternINT.match(string):
     odp= "INT"
 if patternBIT.match(string):
     odp= "BIT"

 return odp 

But i'm not very skilled in using regexes in python..could you please tell, what am i doing wrong? For example it doesn't work for 2010-00-10 which should be Text, but is INT or 20.90, which should be float but is int

Upvotes: 6

Views: 10037

Answers (4)

the wolf
the wolf

Reputation: 35542

Before you go too far down the regex route, have you considered using ast.literal_eval

Examples:

In [35]: ast.literal_eval('1')
Out[35]: 1

In [36]: type(ast.literal_eval('1'))
Out[36]: int

In [38]: type(ast.literal_eval('1.0'))
Out[38]: float

In [40]: type(ast.literal_eval('[1,2,3]'))
Out[40]: list

May as well use Python to parse it for you!

OK, here is a bigger example:

import ast, re
def dataType(str):
    str=str.strip()
    if len(str) == 0: return 'BLANK'
    try:
        t=ast.literal_eval(str)

    except ValueError:
        return 'TEXT'
    except SyntaxError:
        return 'TEXT'

    else:
        if type(t) in [int, long, float, bool]:
            if t in set((True,False)):
                return 'BIT'
            if type(t) is int or type(t) is long:
                return 'INT'
            if type(t) is float:
                return 'FLOAT'
        else:
            return 'TEXT' 



testSet=['   1  ', ' 0 ', 'True', 'False',   #should all be BIT
         '12', '34l', '-3','03',              #should all be INT
         '1.2', '-20.4', '1e66', '35.','-   .2','-.2e6',      #should all be FLOAT
         '10-1', 'def', '10,2', '[1,2]','35.9.6','35..','.']

for t in testSet:
    print "{:10}:{}".format(t,dataType(t))

Output:

   1      :BIT
 0        :BIT
True      :BIT
False     :BIT
12        :INT
34l       :INT
-3        :INT
03        :INT
1.2       :FLOAT
-20.4     :FLOAT
1e66      :FLOAT
35.       :FLOAT
-   .2    :FLOAT
-.2e6     :FLOAT
10-1      :TEXT
def       :TEXT
10,2      :TEXT
[1,2]     :TEXT
35.9.6    :TEXT
35..      :TEXT
.         :TEXT

And if you positively MUST have a regex solution, which produces the same results, here it is:

def regDataType(str):
    str=str.strip()
    if len(str) == 0: return 'BLANK'

    if re.match(r'True$|^False$|^0$|^1$', str):
        return 'BIT'
    if re.match(r'([-+]\s*)?\d+[lL]?$', str): 
        return 'INT'
    if re.match(r'([-+]\s*)?[1-9][0-9]*\.?[0-9]*([Ee][+-]?[0-9]+)?$', str): 
        return 'FLOAT'
    if re.match(r'([-+]\s*)?[0-9]*\.?[0-9][0-9]*([Ee][+-]?[0-9]+)?$', str): 
        return 'FLOAT'

    return 'TEXT' 

I cannot recommend the regex over the ast version however; just let Python do the interpretation of what it thinks these data types are rather than interpret them with a regex...

Upvotes: 25

San4ez
San4ez

Reputation: 8241

In reply to

For example it doesn't work for 2010-00-10 which should be Text, but is INT or 20.90, which should be float but is int

>>> import re
>>> patternINT=re.compile('[0-9]+')
>>> print patternINT.match('2010-00-10')
<_sre.SRE_Match object at 0x7fa17bc69850>
>>> patternINT=re.compile('[0-9]+$')
>>> print patternINT.match('2010-00-10')
None
>>> print patternINT.match('2010')
<_sre.SRE_Match object at 0x7fa17bc69850>

Don't forget $ to limit ending of string.

Upvotes: 1

wkl
wkl

Reputation: 80001

You said that you used these for input:

  • 2010-00-10 (was int, not text)
  • 20.90 (was int, not float)

Your original code:

def dataType(string):

 odp=''
 patternBIT=re.compile('[01]')
 patternINT=re.compile('[0-9]+')
 patternFLOAT=re.compile('[0-9]+\.[0-9]+')
 patternTEXT=re.compile('[a-zA-Z0-9]+')
 if patternTEXT.match(string):
     odp= "text"
 if patternFLOAT.match(string):
     odp= "FLOAT"
 if patternINT.match(string):
     odp= "INT"
 if patternBIT.match(string):
     odp= "BIT"

 return odp 

The Problem

Your if statements would be sequentially executed - that is:

if patternTEXT.match(string):
    odp= "text"
if patternFLOAT.match(string):
    odp= "FLOAT"
if patternINT.match(string)
    odp= "INT"
if patternBIT.match(string):
    odp= "BIT"

"2010-00-10" matches your text pattern, but then it will then try to match against your float pattern (fails because there's not .), then matches against the int pattern, which works because it does contain [0-9]+.

You should use:

if patternTEXT.match(string):
    odp = "text"
elif patternFLOAT.match(string):
    ...

Though for your situation, you probably want to go more specific to less specific, because as you've seen, stuff that is text might also be int (and vice versa). You would need to improve your regular expressions too, as your 'text' pattern only matches for alphanumeric input, but doesn't match against special symbols.

I will offer my own suggestion, though I do like the AST solution more:

def get_type(string):

    if len(string) == 1 and string in ['0', '1']:
        return "BIT"

    # int has to come before float, because integers can be
    # floats.
    try:
        long(string)
        return "INT"
    except ValueError, ve:
        pass

    try:
        float(string)
        return "FLOAT"
    except ValueError, ve:
        pass

    return "TEXT"

Run example:

In [27]: get_type("034")
Out[27]: 'INT'

In [28]: get_type("3-4")
Out[28]: 'TEXT'


In [29]: get_type("20.90")
Out[29]: 'FLOAT'

In [30]: get_type("u09pweur909ru20")
Out[30]: 'TEXT'

Upvotes: 2

Joel Cornett
Joel Cornett

Reputation: 24788

You could also use json.

import json
converted_val = json.loads('32.45')
type(converted_val)

Outputs

type <'float'>

EDIT

To answer your question, however:

re.match() returns partial matches, starting from the beginning of the string. Since you keep evaluating every pattern match the sequence for "2010-00-10" goes like this:

if patternTEXT.match(str_obj): #don't use 'string' as a variable name.

it matches, so odp is set to "text"

then, your script does:

if patternFLOAT.match(str_obj):

no match, odp still equals "text"

if patternINT.match(str_obj):

partial match odp is set to "INT"

Because match returns partial matches, multiple if statements are evaluated and the last one evaluated determines which string is returned in odp.

You can do one of several things:

  1. rearrange the order of your if statements so that the last one to match is the correct one.

  2. use if and elif for the rest of your if statements so that only the first statement to match is evaluated.

  3. check to make sure the match object is matching the entire string:

    ...
    match = patternINT.match(str_obj)
    if match:
        if match.end() == match.endpos:
            #do stuff
    ...
    

Upvotes: 5

Related Questions