Gert Gottschalk
Gert Gottschalk

Reputation: 1716

Using re.findall to extract data from a line

I am trying (and failing so far) to extract time and two measurement data from a text line (read from a file)

The lines have following format

"2013-08-07-21-25   26.0   1015.81"

I tried (among other):

>>> re.findall(r"([0-9,-]+)|(\d+.\d+)", "2013-08-07-21-25   26.0   1015.81")
[('2013-08-07-21-25', ''), ('26', ''), ('0', ''), ('1015', ''), ('81', '')]

And only got entertaining (but not desired) results.

I would like to find a solution like this:

date, temp, press = re.findall(r"The_right_stuff", "2013-08-07-21-25   26.0   1015.81")
print date + '\n' + temp + '\n' + press + '\n'
2013-08-07-21-25
26.0
1015.81

Even better if the assignment could be stuck into a test to check if the number of matches is correct.

if len(date, temp, press = re.findall(r"The_rigth_stuff", "2013-08-07-21-25   26.0   1015.81")) == 3:
    print 'Got good data.'
    print date + '\n' + temp + '\n' + press + '\n'

The lines have be transmitted via serial connection and might have bad (i.e. unexpected) characters interspersed. So it does not work to separate by string index.

See Prevent datetime.strptime from exit in case of format mismatch.


Edit @hjpotter92

I mentioned there were corrupted lines from the serial transmission. The below example failed the split solution.

2013-08-1q-07-15   23.8   1014.92
2013-08-11-07-20   23.8   101$96
6113-p8-11-0-25   23.8   1015*04

Assigning the list of measurements into a numpy array failed.

>>> p_arr= np.asfarray(p_list, dtype='float')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/numpy/lib/type_check.py", line 105, in asfarray
    return asarray(a, dtype=dtype)
  File "/usr/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
        return array(a, dtype, copy=False, order=order)
    ValueError: invalid literal for float(): 101$96

I put the set of data here.

Upvotes: 1

Views: 163

Answers (2)

vks
vks

Reputation: 67968

print [i+j for i,j in re.findall(r"\b(\d+(?!\.)(?:[,-]\d+)*)\b|\b(\d+\.\d+)\b", "2013-08-07-21-25   26.0   1015.81")]

You have to prevent first group from taking anything away from what is meant from the second group.

Output:['2013-08-07-21-25', '26.0', '1015.81']

Upvotes: 1

hjpotter92
hjpotter92

Reputation: 80639

Use a re.split since the data is separated by horizontal-space characters:

date, temp, press = re.split('\s+', "2013-08-07-21-25   26.0   1015.81")

>>> import re
>>> date, temp, press = re.split('\s+', "2013-08-07-21-25   26.0   1015.81")
>>> print date
2013-08-07-21-25
>>> print temp
26.0
>>> print press
1015.81

Upvotes: 2

Related Questions