Reputation: 70
I have a dataframe with below column Title
. Each sentence is repeating three times. I want to split into three columns evenly.
Title
0 [1.3] Avg ticket size - merchant vs industry and benchamark (processed data).[1.3] Avg ticket size - merchant vs industry and benchamark (processed data).[1.3] Avg ticket size - merchant vs industry and benchamark (processed data)
1 [10.1] Overall portfolio and benchmarks for the bank over the last 5 quarters.[10.1] Overall portfolio and benchmarks for the bank over the last 5 quarters.[10.1] Overall portfolio and benchmarks for the bank over the last 5 quarters
2 [10.10] Decline Reasons (Quarter-wise)- E-Com vs. POS and comparison with benchmark.[10.10] Decline Reasons (Quarter-wise)- E-Com vs. POS and comparison with benchmark.[10.10] Decline Reasons (Quarter-wise)- E-Com vs. POS and comparison with benchmark
3 [10.2] QoQ Pct of Transaction Method Spend, Transaction Method Active Cards.[10.2] QoQ Pct of Transaction Method Spend, Transaction Method Active Cards.[10.2] QoQ Pct of Transaction Method Spend, Transaction Method Active Cards
4 [10.3] Pct of Transaction Method Spend, Transaction Method Active Cards by Product Type.[10.3] Pct of Transaction Method Spend, Transaction Method Active Cards by Product Type.[10.3] Pct of Transaction Method Spend, Transaction Method Active Cards by Product Type
5 [10.4] QoQ of average ticket size, transactions per card, average spend per card.[10.4] QoQ of average ticket size, transactions per card, average spend per card.[10.4] QoQ of average ticket size, transactions per card, average spend per card
I tried below code but did not work as expected.
import textwrap
pd.DataFrame([textwrap.wrap(el, len(el)//3) for el in df['Title']]).add_prefix('Title')
I want to split by finding the length of string and then split based on len(string)/3. Because sometimes there will be . in middle of sentence
Please help
Upvotes: 0
Views: 73
Reputation: 88226
Looks like you can use str.split
here:
df.Title.str.split(r'\[\d+\.\d+\]\s', expand=True)
0 1 \
0 Avg ticket size - merchant vs industry and ben...
1 Overall portfolio and benchmarks for the bank ...
2 Decline Reasons (Quarter-wise)- E-Com vs. POS ...
3 QoQ Pct of Transaction Method Spend, Transacti...
4 Pct of Transaction Method Spend, Transaction M...
5 QoQ of average ticket size, transactions per c...
2 \
0 Avg ticket size - merchant vs industry and ben...
1 Overall portfolio and benchmarks for the bank ...
2 Decline Reasons (Quarter-wise)- E-Com vs. POS ...
3 QoQ Pct of Transaction Method Spend, Transacti...
4 Pct of Transaction Method Spend, Transaction M...
5 QoQ of average ticket size, transactions per c...
3
0 Avg ticket size - merchant vs industry and ben...
1 Overall portfolio and benchmarks for the bank ...
2 Decline Reasons (Quarter-wise)- E-Com vs. POS ...
3 QoQ Pct of Transaction Method Spend, Transacti...
4 Pct of Transaction Method Spend, Transaction M...
5 QoQ of average ticket size, transactions per c...
Update
If you want to find the length of the strings in each row and split into 3, one approach could be:
n = 3
lens = df.Title.str.len()//n
l = [[i[(c-1)*sl:c*sl] for i, sl in zip(df.Title, lens)] for c in range(1, n+1)]
pd.DataFrame.from_records(l).T
0 \
0 [1.3] Avg ticket size - merchant vs industry a...
1 [10.1] Overall portfolio and benchmarks for th...
2 [10.10] Decline Reasons (Quarter-wise)- E-Com ...
3 [10.2] QoQ Pct of Transaction Method Spend, Tr...
4 [10.3] Pct of Transaction Method Spend, Transa...
5 [10.4] QoQ of average ticket size, transaction...
1 \
0 .[1.3] Avg ticket size - merchant vs industry ...
1 .[10.1] Overall portfolio and benchmarks for t...
2 .[10.10] Decline Reasons (Quarter-wise)- E-Com...
3 .[10.2] QoQ Pct of Transaction Method Spend, T...
4 .[10.3] Pct of Transaction Method Spend, Trans...
5 .[10.4] QoQ of average ticket size, transactio...
2
0 ).[1.3] Avg ticket size - merchant vs industry...
1 s.[10.1] Overall portfolio and benchmarks for ...
2 k.[10.10] Decline Reasons (Quarter-wise)- E-Co...
3 s.[10.2] QoQ Pct of Transaction Method Spend, ...
4 e.[10.3] Pct of Transaction Method Spend, Tran...
5 d.[10.4] QoQ of average ticket size, transacti...
Upvotes: 3
Reputation: 30971
Instead of divide the length by 3, you can use regex to split the string.
Note that such a split pattern can be defined as:
r'\.(?=\[)'
i.e.:
\.
- a (literal) dot,(?=\[\d+\.\d+])
- followed by [, a sequence of digits, another dot,
another sequence of digits and ].Note that the second part is a positive lookahead, so it will not be a part of match and this [nn,mm] fragment will be the starting part of the next string (result of split).
You didn't write about deleting this [nn,mm] fragment, so it should remain in the result of the split and this is the downside of the other answer.
So the code generating df concatenated with 3 new columns can be:
df2 = pd.concat([df, df.Title.str.split(
r'\.(?=\[\d+\.\d+])', expand=True)], axis=1)
Upvotes: 1