Spliting a DataFrame column to columns with regex?

Question

I have a DataFrame with two columns the second one has following format :

1 {{continuity 1.0000e+00} {x-velocity 0.0000e+00} {y-velocity 4.4010e-02} {z-velocity 9.5681e-04} {energy 1.1549e-07} }
2 {{continuity 1.0000e+00} {x-velocity 7.8788e-04} {y-velocity 1.2617e+01} {z-velocity 9.0445e-04} {energy 4.5605e-06} }
3 {{continuity 2.3250e-01} {x-velocity 1.6896e-03} {y-velocity 1.2536e-02} {z-velocity 9.8980e-03} {energy 3.4032e-06} }
4 {{continuity 8.0243e-02} {x-velocity 2.2180e-03} {y-velocity 1.3189e-02} {z-velocity 1.0225e-02} {energy 4.6336e-07} }
5 {{continuity 7.0923e-02} {x-velocity 2.2674e-03} {y-velocity 1.2308e-02}

And I'm trying to use regex to split it into columns, by getting the first number, then getting all the numbers in between the brackets "{}" and give them the following names:

names=['iter', 'x', 'x-vel', 'y-vel', 'z-vel', 'energy']

However I just don't seem to make the regular expression work, here's what i'm doing in a simple example:

Input

>>> a = "1 {{continuity 1.0000e+00} {x-velocity 0.0000e+00} {y-velocity 4.4010e-02} {z-velocity 9.5681e-04} {energy 1.1549e-07} }"
>>> re.findall("(\d*) {*\{\D*(.*?)\}", a)

Result

 >>> [('1', '1.0000e+00'), ('', '0.0000e+00'), ('', '4.4010e-02'), ('', '9.5681e-04'), ('', '1.1549e-07')]

As you can see my regex keeps looking for a number for every {} occurrence, but I don't want that to happen, how to do so?

Expected Behavior

 >>> [('1'), ('1.0000e+00'), ('0.0000e+00'), ('4.4010e-02'), ('9.5681e-04'), ('1.1549e-07')]

When my regular expression works, I'm trying to assign all the columns with a line that would look something like this:

df[names] = df.first.str.extract(r'(\d*) {*\{\D*(.*?)\}', expand=True)

I'm really new to dataframes, is this the correct approach for this problem?

Any help would be much appreciated, thanks in advance!

gilch · Accepted Answer

First, let's make a series from some data in the question.

import pandas as pd    

data = pd.Series('''\
1 {{continuity 1.0000e+00} {x-velocity 0.0000e+00} {y-velocity 4.4010e-02} {z-velocity 9.5681e-04} {energy 1.1549e-07} }
2 {{continuity 1.0000e+00} {x-velocity 7.8788e-04} {y-velocity 1.2617e+01} {z-velocity 9.0445e-04} {energy 4.5605e-06} }
3 {{continuity 2.3250e-01} {x-velocity 1.6896e-03} {y-velocity 1.2536e-02} {z-velocity 9.8980e-03} {energy 3.4032e-06} }
4 {{continuity 8.0243e-02} {x-velocity 2.2180e-03} {y-velocity 1.3189e-02} {z-velocity 1.0225e-02} {energy 4.6336e-07} }'''
          .split('
'))
print(data)

0    1 {{continuity 1.0000e+00} {x-velocity 0.0000e...
1    2 {{continuity 1.0000e+00} {x-velocity 7.8788e...
2    3 {{continuity 2.3250e-01} {x-velocity 1.6896e...
3    4 {{continuity 8.0243e-02} {x-velocity 2.2180e...
dtype: object
       0

The first option is a simple regex to find all numbers in order. Use extractall to find every match in each string. This may be good enough. You still have to name the columns, which isn't hard. This will have a MultiIndex (which is a little more advanced), since each match could have multiple groups (but this regex only has one group), hence the need to .unstack() it.

print(data.str.extractall(r'(\d[\d.e+-]*)').unstack())

match  0           1           2           3           4           5
0      1  1.0000e+00  0.0000e+00  4.4010e-02  9.5681e-04  1.1549e-07
1      2  1.0000e+00  7.8788e-04  1.2617e+01  9.0445e-04  4.5605e-06
2      3  2.3250e-01  1.6896e-03  1.2536e-02  9.8980e-03  3.4032e-06
3      4  8.0243e-02  2.2180e-03  1.3189e-02  1.0225e-02  4.6336e-07

Alternatively, you can use a named groups regex. This is a fairly straightforward template to build from one of the strings. This will put the names from the regex into the columns. But the names must be valid Python identifiers. (x_vel, not x-vel). But this is probably what you want anyway, since it lets you access the columns as attrs (like df.x_vel instead of df['x-vel']). This (?P...) named group syntax is all explained in the re module docs.

print(
    data.str.extract(r'(?P\d+) {{continuity (?P[^}]+)} {x-velocity (?P[^}]+)} {y-velocity (?P[^}]+)} {z-velocity (?P[^}]+)} {energy (?P[^}]+)} }',
                     expand=False))

  iter           x       x_vel       y_vel       z_vel      energy
0    1  1.0000e+00  0.0000e+00  4.4010e-02  9.5681e-04  1.1549e-07
1    2  1.0000e+00  7.8788e-04  1.2617e+01  9.0445e-04  4.5605e-06
2    3  2.3250e-01  1.6896e-03  1.2536e-02  9.8980e-03  3.4032e-06
3    4  8.0243e-02  2.2180e-03  1.3189e-02  1.0225e-02  4.6336e-07

Note that we're using extract instead of extractall here because there are multiple groups in the regex itself.

Spliting a DataFrame column to columns with regex?

Answers (1)

Related Questions