johnnyb
johnnyb

Reputation: 1825

Merging data in a single pandas column based on criteria

I have a pandas dataframe that contains a multitude of data like below:

temp_col
matt
joes\crabshack\one23
fail
joe:123,\
12345678,\
92313456,\
12341239123432,\
1321143
john
jacob
joe(x):543,\
9876544123,\
1234

How can take all of the data that ends with a ",\" and the remainder row that doesnt have one and merge them into a single row?

Expected output:

temp_col
matt
joes\crabshack\one23
fail
joe:1231234567892313456123412391234321321143
john
jacob
joe(x):54398765441231234

Upvotes: 0

Views: 62

Answers (3)

xmduhan
xmduhan

Reputation: 1025

I think it's better style to process this before(or when) you loading data to pandas DataFrame. But if you insist on doing so, try this:

from pandas import DataFrame
df = DataFrame({'x': [
'matt', 
'joes\crabshack\one23',
'fail',
'joe:123,\\',
'12345678,\\',
'92313456,\\',
'12341239123432,\\',
'1321143',
'john',
'jacob',
'joe(x):543,\\',
'9876544123,\\'
'1234']})
df['g'] = (1 - df['x'].str.endswith('\\').astype(int).shift().fillna(0)).cumsum()
df = df.groupby('g')['x'].sum().apply(lambda x: x.replace('\\', ''))
df

Upvotes: 0

Back2Basics
Back2Basics

Reputation: 7806

Since the data is wrapped (I'm assuming you see this '\' in there so it's part of the same cell. then it's just a comma delimited number.

df.columnnamehere.str.split(',').str.join(sep='')

or if '\' is an actual character not just for formatting

df.columnnamehere.str.split(',\').str.join(sep='')

Upvotes: 0

akuiper
akuiper

Reputation: 215067

You can try this:

(df.temp_col.groupby((~df.temp_col.str.contains(r",\\$")).shift().fillna(True).cumsum())
 .apply(lambda x: "".join(x.str.rstrip(r",\\"))))

#temp_col
#1                                            matt
#2                            joes\crabshack\one23
#3                                            fail
#4    joe:1231234567892313456123412391234321321143
#5                                            john
#6                                           jacob
#7                        joe(x):54398765441231234
#Name: temp_col, dtype: object

Break down:

1) create a group variable where a new group is generated when the element doesn't end with ,\:

g = (~df.temp_col.str.contains(r",\\$")).shift().fillna(True).cumsum()
g
#0     1
#1     2
#2     3
#3     4
#4     4
#5     4
#6     4
#7     4
#8     5
#9     6
#10    7
#11    7
#12    7
#Name: temp_col, dtype: int64

2) define a join function that strips the ending comma and back slash;

join_clean = lambda x: "".join(x.str.rstrip(r",\\"))

3) apply the join function to each group to concatenate consecutive rows ending with ,\:

df.temp_col.groupby(g).apply(join_clean)

#temp_col
#1                                            matt
#2                            joes\crabshack\one23
#3                                            fail
#4    joe:1231234567892313456123412391234321321143
#5                                            john
#6                                           jacob
#7                        joe(x):54398765441231234
#Name: temp_col, dtype: object

Upvotes: 1

Related Questions