Reputation: 484
I have a DataFrame with two columns the second one has following format :
1 {{continuity 1.0000e+00} {x-velocity 0.0000e+00} {y-velocity 4.4010e-02} {z-velocity 9.5681e-04} {energy 1.1549e-07} }
2 {{continuity 1.0000e+00} {x-velocity 7.8788e-04} {y-velocity 1.2617e+01} {z-velocity 9.0445e-04} {energy 4.5605e-06} }
3 {{continuity 2.3250e-01} {x-velocity 1.6896e-03} {y-velocity 1.2536e-02} {z-velocity 9.8980e-03} {energy 3.4032e-06} }
4 {{continuity 8.0243e-02} {x-velocity 2.2180e-03} {y-velocity 1.3189e-02} {z-velocity 1.0225e-02} {energy 4.6336e-07} }
5 {{continuity 7.0923e-02} {x-velocity 2.2674e-03} {y-velocity 1.2308e-02}
And I'm trying to use regex to split it into columns, by getting the first number, then getting all the numbers in between the brackets "{}" and give them the following names:
names=['iter', 'x', 'x-vel', 'y-vel', 'z-vel', 'energy']
However I just don't seem to make the regular expression work, here's what i'm doing in a simple example:
Input
>>> a = "1 {{continuity 1.0000e+00} {x-velocity 0.0000e+00} {y-velocity 4.4010e-02} {z-velocity 9.5681e-04} {energy 1.1549e-07} }"
>>> re.findall("(\d*) {*\{\D*(.*?)\}", a)
Result
>>> [('1', '1.0000e+00'), ('', '0.0000e+00'), ('', '4.4010e-02'), ('', '9.5681e-04'), ('', '1.1549e-07')]
As you can see my regex keeps looking for a number for every {} occurrence, but I don't want that to happen, how to do so?
Expected Behavior
>>> [('1'), ('1.0000e+00'), ('0.0000e+00'), ('4.4010e-02'), ('9.5681e-04'), ('1.1549e-07')]
When my regular expression works, I'm trying to assign all the columns with a line that would look something like this:
df[names] = df.first.str.extract(r'(\d*) {*\{\D*(.*?)\}', expand=True)
I'm really new to dataframes, is this the correct approach for this problem?
Any help would be much appreciated, thanks in advance!
Upvotes: 0
Views: 76
Reputation: 11651
First, let's make a series from some data in the question.
import pandas as pd
data = pd.Series('''\
1 {{continuity 1.0000e+00} {x-velocity 0.0000e+00} {y-velocity 4.4010e-02} {z-velocity 9.5681e-04} {energy 1.1549e-07} }
2 {{continuity 1.0000e+00} {x-velocity 7.8788e-04} {y-velocity 1.2617e+01} {z-velocity 9.0445e-04} {energy 4.5605e-06} }
3 {{continuity 2.3250e-01} {x-velocity 1.6896e-03} {y-velocity 1.2536e-02} {z-velocity 9.8980e-03} {energy 3.4032e-06} }
4 {{continuity 8.0243e-02} {x-velocity 2.2180e-03} {y-velocity 1.3189e-02} {z-velocity 1.0225e-02} {energy 4.6336e-07} }'''
.split('\n'))
print(data)
0 1 {{continuity 1.0000e+00} {x-velocity 0.0000e...
1 2 {{continuity 1.0000e+00} {x-velocity 7.8788e...
2 3 {{continuity 2.3250e-01} {x-velocity 1.6896e...
3 4 {{continuity 8.0243e-02} {x-velocity 2.2180e...
dtype: object
0
The first option is a simple regex to find all numbers in order. Use extractall
to find every match in each string. This may be good enough. You still have to name the columns, which isn't hard. This will have a MultiIndex
(which is a little more advanced), since each match could have multiple groups (but this regex only has one group), hence the need to .unstack()
it.
print(data.str.extractall(r'(\d[\d.e+-]*)').unstack())
match 0 1 2 3 4 5
0 1 1.0000e+00 0.0000e+00 4.4010e-02 9.5681e-04 1.1549e-07
1 2 1.0000e+00 7.8788e-04 1.2617e+01 9.0445e-04 4.5605e-06
2 3 2.3250e-01 1.6896e-03 1.2536e-02 9.8980e-03 3.4032e-06
3 4 8.0243e-02 2.2180e-03 1.3189e-02 1.0225e-02 4.6336e-07
Alternatively, you can use a named groups regex. This is a fairly straightforward template to build from one of the strings. This will put the names from the regex into the columns. But the names must be valid Python identifiers. (x_vel
, not x-vel
). But this is probably what you want anyway, since it lets you access the columns as attrs (like df.x_vel
instead of df['x-vel']
). This (?P<foo>...)
named group syntax is all explained in the re
module docs.
print(
data.str.extract(r'(?P<iter>\d+) {{continuity (?P<x>[^}]+)} {x-velocity (?P<x_vel>[^}]+)} {y-velocity (?P<y_vel>[^}]+)} {z-velocity (?P<z_vel>[^}]+)} {energy (?P<energy>[^}]+)} }',
expand=False))
iter x x_vel y_vel z_vel energy
0 1 1.0000e+00 0.0000e+00 4.4010e-02 9.5681e-04 1.1549e-07
1 2 1.0000e+00 7.8788e-04 1.2617e+01 9.0445e-04 4.5605e-06
2 3 2.3250e-01 1.6896e-03 1.2536e-02 9.8980e-03 3.4032e-06
3 4 8.0243e-02 2.2180e-03 1.3189e-02 1.0225e-02 4.6336e-07
Note that we're using extract
instead of extractall
here because there are multiple groups in the regex itself.
Upvotes: 2