Reputation: 1679
I have a df that has 1 column where each row contains a string. It looks like this:
df
data
in 9.14 out 9.66 type 0.0
in 9.67 out 9.69 type 0.0
in 9.70 out 10.66 type 0.0
in 10.67 out 11.34 type 2.0
in 11.35 out 12.11 type 2.0
I want to split the text of this column into multiple columns. I want to use the words [in, out, type] as column headers, and the values following each word as the row values. The result will have 3 columns labeled in, out and type and will look like this:
df
in out type
9.14 9.66 0.0
9.67 9.69 0.0
9.70 10.66 0.0
10.67 11.34 2.0
11.35 12.11 2.0
Thanks!
Upvotes: 1
Views: 1019
Reputation: 25259
If you data separated evenly between name
and value
by white-spaces as in your sample , you may use split
and str
accessor with stride to construct the desired output
df1 = df['data'].str.split()
df_out = pd.DataFrame(df1.str[1::2].tolist(), columns=df1[0][0::2])
Out[1097]:
in out type
0 9.14 9.66 0.0
1 9.67 9.69 0.0
2 9.70 10.66 0.0
3 10.67 11.34 2.0
4 11.35 12.11 2.0
Upvotes: 0
Reputation: 51155
If you know in advance what the words will be, and also can guarantee that there won't be any bad data, this is a simple str.extract
problem, where you can construct a robust regular expression to capture each group, using named groups to create the DataFrame in a single pass. That regular expression for your sample data is contained in approach #2.
However, for the sake of demonstration, it is better to assume that you might have bad data, and that you might not know in advance what your column names are. In that case, you can use str.extractall
and some unstack
ing.
Option 1
extractall
+ set_index
+ unstack
generic_regex = r'([a-zA-Z]+)[^0-9]+([0-9\.]+)'
df['data'].str.extractall(generic_regex).set_index(0, append=True)[1].unstack([0, 1])
0 in out type
match 0 1 2
0 9.14 9.66 0.0
1 9.67 9.69 0.0
2 9.70 10.66 0.0
3 10.67 11.34 2.0
4 11.35 12.11 2.0
Option 2
Define an explicit regex and use extract
rgx = r'in\s+(?P<in>[^\s]+)\s+out\s+(?P<out>[^\s]+)\s+type\s+(?P<type>[^\s]+)'
df['data'].str.extract(rgx)
in out type
0 9.14 9.66 0.0
1 9.67 9.69 0.0
2 9.70 10.66 0.0
3 10.67 11.34 2.0
4 11.35 12.11 2.0
Upvotes: 1