Arvinth
Arvinth

Reputation: 70

Issue while splitting pandas dataframe column into n columns

I have a dataframe with below column Title. Each sentence is repeating three times. I want to split into three columns evenly.

    Title
0  [1.3] Avg ticket size - merchant vs industry and benchamark (processed data).[1.3] Avg ticket size - merchant vs industry and benchamark (processed data).[1.3] Avg ticket size - merchant vs industry and benchamark (processed data)
1  [10.1] Overall portfolio and benchmarks for the bank over the last 5 quarters.[10.1] Overall portfolio and benchmarks for the bank over the last 5 quarters.[10.1] Overall portfolio and benchmarks for the bank over the last 5 quarters
2  [10.10] Decline Reasons (Quarter-wise)- E-Com vs. POS and comparison with benchmark.[10.10] Decline Reasons (Quarter-wise)- E-Com vs. POS and comparison with benchmark.[10.10] Decline Reasons (Quarter-wise)- E-Com vs. POS and comparison with benchmark
3  [10.2] QoQ Pct of Transaction Method Spend, Transaction Method Active Cards.[10.2] QoQ Pct of Transaction Method Spend, Transaction Method Active Cards.[10.2] QoQ Pct of Transaction Method Spend, Transaction Method Active Cards
4  [10.3] Pct of Transaction Method Spend, Transaction Method Active Cards by Product Type.[10.3] Pct of Transaction Method Spend, Transaction Method Active Cards by Product Type.[10.3] Pct of Transaction Method Spend, Transaction Method Active Cards by Product Type
5  [10.4] QoQ of average ticket size, transactions per card, average spend per card.[10.4] QoQ of average ticket size, transactions per card, average spend per card.[10.4] QoQ of average ticket size, transactions per card, average spend per card

I tried below code but did not work as expected.

import textwrap
pd.DataFrame([textwrap.wrap(el, len(el)//3) for el in df['Title']]).add_prefix('Title') 

I want to split by finding the length of string and then split based on len(string)/3. Because sometimes there will be . in middle of sentence

Please help

Upvotes: 0

Views: 73

Answers (2)

yatu
yatu

Reputation: 88226

Looks like you can use str.split here:

df.Title.str.split(r'\[\d+\.\d+\]\s', expand=True)

0                                                  1  \
0    Avg ticket size - merchant vs industry and ben...   
1    Overall portfolio and benchmarks for the bank ...   
2    Decline Reasons (Quarter-wise)- E-Com vs. POS ...   
3    QoQ Pct of Transaction Method Spend, Transacti...   
4    Pct of Transaction Method Spend, Transaction M...   
5    QoQ of average ticket size, transactions per c...   

                                                   2  \
0  Avg ticket size - merchant vs industry and ben...   
1  Overall portfolio and benchmarks for the bank ...   
2  Decline Reasons (Quarter-wise)- E-Com vs. POS ...   
3  QoQ Pct of Transaction Method Spend, Transacti...   
4  Pct of Transaction Method Spend, Transaction M...   
5  QoQ of average ticket size, transactions per c...   

                                                   3  
0  Avg ticket size - merchant vs industry and ben...  
1  Overall portfolio and benchmarks for the bank ...  
2  Decline Reasons (Quarter-wise)- E-Com vs. POS ...  
3  QoQ Pct of Transaction Method Spend, Transacti...  
4  Pct of Transaction Method Spend, Transaction M...  
5  QoQ of average ticket size, transactions per c...  

Update

If you want to find the length of the strings in each row and split into 3, one approach could be:

n = 3
lens = df.Title.str.len()//n
l = [[i[(c-1)*sl:c*sl] for i, sl in zip(df.Title, lens)] for c in range(1, n+1)]
pd.DataFrame.from_records(l).T

                                0  \
0  [1.3] Avg ticket size - merchant vs industry a...   
1  [10.1] Overall portfolio and benchmarks for th...   
2  [10.10] Decline Reasons (Quarter-wise)- E-Com ...   
3  [10.2] QoQ Pct of Transaction Method Spend, Tr...   
4  [10.3] Pct of Transaction Method Spend, Transa...   
5  [10.4] QoQ of average ticket size, transaction...   

                                                   1  \
0  .[1.3] Avg ticket size - merchant vs industry ...   
1  .[10.1] Overall portfolio and benchmarks for t...   
2  .[10.10] Decline Reasons (Quarter-wise)- E-Com...   
3  .[10.2] QoQ Pct of Transaction Method Spend, T...   
4  .[10.3] Pct of Transaction Method Spend, Trans...   
5  .[10.4] QoQ of average ticket size, transactio...   

                                                   2  
0  ).[1.3] Avg ticket size - merchant vs industry...  
1  s.[10.1] Overall portfolio and benchmarks for ...  
2  k.[10.10] Decline Reasons (Quarter-wise)- E-Co...  
3  s.[10.2] QoQ Pct of Transaction Method Spend, ...  
4  e.[10.3] Pct of Transaction Method Spend, Tran...  
5  d.[10.4] QoQ of average ticket size, transacti...  

Upvotes: 3

Valdi_Bo
Valdi_Bo

Reputation: 30971

Instead of divide the length by 3, you can use regex to split the string.

Note that such a split pattern can be defined as:

r'\.(?=\[)'

i.e.:

  • \. - a (literal) dot,
  • (?=\[\d+\.\d+]) - followed by [, a sequence of digits, another dot, another sequence of digits and ].

Note that the second part is a positive lookahead, so it will not be a part of match and this [nn,mm] fragment will be the starting part of the next string (result of split).

You didn't write about deleting this [nn,mm] fragment, so it should remain in the result of the split and this is the downside of the other answer.

So the code generating df concatenated with 3 new columns can be:

df2 = pd.concat([df, df.Title.str.split(
    r'\.(?=\[\d+\.\d+])', expand=True)], axis=1)

Upvotes: 1

Related Questions