Kenan
Kenan

Reputation: 14114

Pandas extract numbers from column into new columns

I currently have this df where the rect column is all strings. I need to extract the x, y, w and h from it into separate columns. The dataset is very large so I need an efficient approach

df['rect'].head()
0    <Rect (120,168),260 by 120>
1    <Rect (120,168),260 by 120>
2    <Rect (120,168),260 by 120>
3    <Rect (120,168),260 by 120>
4    <Rect (120,168),260 by 120>

So far this solution works however it's very messy as you can see

df[['x', 'y', 'w', 'h']] = df['rect'].str.replace('<Rect \(', '').str.replace('\),', ',').str.replace(' by ', ',').str.replace('>', '').str.split(',', n=3, expand=True)

Is there a better way? Possibly a regex approach

Upvotes: 5

Views: 3442

Answers (5)

piRSquared
piRSquared

Reputation: 294508

Inline

Produce a copy

df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))

                          rect    x    y    w    h
0  <Rect (120,168),260 by 120>  120  168  260  120
1  <Rect (120,168),260 by 120>  120  168  260  120
2  <Rect (120,168),260 by 120>  120  168  260  120
3  <Rect (120,168),260 by 120>  120  168  260  120
4  <Rect (120,168),260 by 120>  120  168  260  120

Or just reassign to df

df = df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))

df

                          rect    x    y    w    h
0  <Rect (120,168),260 by 120>  120  168  260  120
1  <Rect (120,168),260 by 120>  120  168  260  120
2  <Rect (120,168),260 by 120>  120  168  260  120
3  <Rect (120,168),260 by 120>  120  168  260  120
4  <Rect (120,168),260 by 120>  120  168  260  120

Inplace

Modify existing df

df[[*'xywh']] = pd.DataFrame(df.rect.str.findall('\d+').tolist())

df

                          rect    x    y    w    h
0  <Rect (120,168),260 by 120>  120  168  260  120
1  <Rect (120,168),260 by 120>  120  168  260  120
2  <Rect (120,168),260 by 120>  120  168  260  120
3  <Rect (120,168),260 by 120>  120  168  260  120
4  <Rect (120,168),260 by 120>  120  168  260  120

Upvotes: 5

BENY
BENY

Reputation: 323366

Using extractall

df[['x', 'y', 'w', 'h']] = df['rect'].str.extractall('(\d+)').unstack().loc[:,0]
Out[267]: 
match    0    1    2    3
0      120  168  260  120
1      120  168  260  120
2      120  168  260  120
3      120  168  260  120
4      120  168  260  120

Upvotes: 6

Lev Zakharov
Lev Zakharov

Reputation: 2427

Use str.extract, which extracts groups from regex into columns:

df['rect'].str.extract(r'\((?P<x>\d+),(?P<y>\d+)\),(?P<w>\d+) by (?P<h>\d+)', expand=True)

Result:

     x    y    w    h
0  120  168  260  120
1  120  168  260  120
2  120  168  260  120
3  120  168  260  120
4  120  168  260  120

Upvotes: 3

anthonyserious
anthonyserious

Reputation: 1946

This is one of those cases where it makes sense to "optimize" the data itself instead of trying to morph it into what a consumer wants. It's much easier to change clean data into a specialized format than it is to change a specialized format into something portable.

That said, if you really have to parse this, you can do something like

>>> import re
>>> re.findall(r'\d+', '<Rect (120,168),260 by 120>')
['120', '168', '260', '120']
>>>

Upvotes: 0

akuiper
akuiper

Reputation: 215107

If the strings follow a specific format <Rect \((\d+),(\d+)\),(\d+) by (\d+)>, you can use this regular expression with str.extract method:

df[['x','y','w','h']] = df.rect.str.extract(r'<Rect \((\d+),(\d+)\),(\d+) by (\d+)>')

df
#                          rect    x    y    w    h
#0  <Rect (120,168),260 by 120>  120  168  260  120
#1  <Rect (120,168),260 by 120>  120  168  260  120
#2  <Rect (120,168),260 by 120>  120  168  260  120
#3  <Rect (120,168),260 by 120>  120  168  260  120
#4  <Rect (120,168),260 by 120>  120  168  260  120

Upvotes: 3

Related Questions