Reputation: 79
I have a a column of string. The data does not follow any particular format. I need to find all numbers which are separated by commas.
For example,
string = "There are 5 people in the class and their heights 3,9,6,7,4".
I want to just extract the number 3,9,6,7,4 without the number 5. I ultimately want to concatenate the word before the first number to each number. i.e heights3,heights9,heights6,heights7,heights4.
ExampleString = "There are 5 people in the class and their heights are 3,9,6,7,4"
temp = re.findall(r'\s\d+\b',ExampleString)
Here I get number 5 as well.
Upvotes: 3
Views: 2014
Reputation: 14721
To extract a sequence of numbers in any string:
import re
# some random text just for testing
string = "azrazer 5,6,4 qsfdqdf 5,,1,2,!,88,9,44,aa,2"
# retrieve all sequence of number separated by ','
r = r'(?:\d+,)+\d+'
# retrieve all sequence of number separated by ',' except the last one
r2 = r'((?:\d+,)+)(?:\d+)'
# best answers for question so far
r3 = r'[\d,]+[,\d]+[^a-z]'
r4 = r'[\d,]+[,\d]'
print('findall r1: ', re.findall(r, string))
print('findall r2:', re.findall(r3, string))
print('findall r3:', re.findall(r4, string))
print('-----------------------------------------')
print('findall r2:', re.findall(r2, string))
Out put:
findall r1: ['5,6,4', '1,2', '88,9,44'] ---> correct
findall r3: ['5,6,4 ', '5,,1,2,!', ',88,9,44,'] --> wrong
findall r4: ['5,6,4', '5,,1,2,', ',88,9,44,', ',2'] --> wrong
-----------------------------------------
findall r2: ['5,6,', '1,', '88,9,'] --> correct exclude the last element
Upvotes: 0
Reputation: 4487
Regex is your friend. You can solve your problem with just one line of code:
[int(n) for n in sum([l.split(',') for l in re.findall(r'[\d,]+[,\d]', test_string)], []) if n.isdigit()]
Ok, let's explain step by step:
The following code produced the list of string numbers delimited by comma:
test_string = "There are 5 people in the class and their heights are 3,9,6,7,4 and this 55,66, 77"
list_of_comma = [l for l in re.findall(r'[\d,]+[,\d]', test_string)]
# output: ['3,9,6,7,4', '55,66,', '77']
Divides list_of_comma
and produces a list_of_lists of characters:
list_of_list = [l.split(',') for l in list_of_comma]
# output: [['3', '9', '6', '7', '4'], ['55', '66', ''], ['77']]
I use a trick to unpack the list of the list:
lst = sum(list_of_list, [])
# output: ['3', '9', '6', '7', '4', '55', '66', '', '77']
Convert each element to an integer and exclude non integers:
int_list = [int(n) for n in lst if n.isdigit()]
# output: [3, 9, 6, 7, 4, 55, 66, 77]
EDIT: if you want to format the numeric list in the required format:
keyword= ',heights'
formatted_res = keyword[1:] + keyword.join(map(str,res))
# output: 'heights3,heights9,heights6,heights7,heights4,heights55,heights66,heights77'
Upvotes: 2
Reputation: 4219
As stated in the commnents the 4
isn't followed by any number (so leaving it out):
>>> t = "There are 5 people in the class and their heights are 3,9,6,7,4"
>>> 'heights'+'heights'.join(re.findall(r'\d+,', t)).rstrip(',')
'heights3,heights9,heights6,heights7'
And if you want to include it you can:
>>> 'heights'+'heights'.join(re.findall(r'\d+,|(?<=,)\d+', t))
'heights3,heights9,heights6,heights7,heights4'
Upvotes: 0
Reputation: 1413
This should work. \d
is a digit (a character in the range 0-9), and +
means 1 or more times
import re
test_string = "There are 2 apples for 4 persons 4 helasdf 4 23 "
print("The original string : " + test_string)
temp = re.findall(r'\d+', test_string)
res = list(map(int, temp))
print("The numbers list is : " + str(res))
Upvotes: 0