Reputation: 11
I am cleaning data in my pandas dataframe, and i hope there is a better way than mine, to do this. I have in the column["count"] in my pandas dateframe input like his:
~186-205
4 and 4
200
800-1000
550-550[2]
10, 20 or 50
5 (four score and bla bla)
38 or 30
88-80
If somebody could tell me how to add numbers together if they say "x and x" that would be great. However, my main goal is just to get the lowest number from each row and everything else gone.
I succeed almost entirely with my solution:
df['Count'] = df['Count'].str.replace(r"\(.*\)","") #all square brackets with content
df['Count'] = df['Count'].str.replace(r"\[.*\]","") #all square brackets with content
df['Count'] = df['Count'].str.replace("(−).*","") #For one type of hyphens
df['Count'] = df['Count'].str.replace("(-).*","") #for another type of hyphens
df['Count'] = df['Count'].str.replace("(—).*","") #for yet another type of hyphens
df['Count'] = df['Count'].str.replace("(\u2013).*","") #because of different formating for hyphens
df['Count'] = df['Count'].str.replace("(or).*","") #for other alternatives, remove
df['Count'] = df['Count'].str.replace("(,).*","") #everything after commas
df['Count'] = df['Count'].replace(r'\D+', "", regex=True) #everything but numbers
any suggestions to make this more elegant? either in a function, for loop or just something smarter...
Thank you for your time.
Upvotes: 0
Views: 150
Reputation: 1
About your solution for stripping out unneeded symbols from the values, you can use the the built-in re module to collect all numbers in the string and just get the lowest one from them:
import re
min(map(int, re.findall(r'[0-9]+', value)))
To support only python operations you might try the built-in eval function, but if you need to support different operations like 'and' to sum your numbers, you will probably need to write a parser for more customizations. This is a cool article you can check for parsers and what are their parts.
Edit:
To apply it to the whole column extract to function of smallest number and then apply that function.
import re
def get_min_number(value):
return min(map(int, re.findall(r'[0-9]+', value)))
df['Count'].apply(get_min_number)
Upvotes: 0