denz franc
denz franc

Reputation: 105

How to Split Data and concat based on List Values in python?

I need to Split Data and Map Values based on the list.

df

Id      String
1       JHA PQR 20 STO KJAN
2       LKS JHA PLA; NIYM
3       LMA\KHA 20 HYS,KNSN
4       JHA, PQR STO 20 KJAM
5       JHA PQR|STO/KJAOP

List_to_map = [JHA, LMA, STO, PQR, LKS]

df_output

Id      String                 Values
1       JHA PQR 20 STO KJAN    JHA+PQR+STO
2       LKS JHA PLA; NIYM      LKS+JHA
3       LMA\KHA 20 HYS,KNSN    LMA
4       JHA, PQR STO 20 KJAM   JHA+PQR+STO
5       JHA PQR|STO/KJAOP      JHA+PQR+STO

I need to map the Column String Values to list, if list exist those values it needs to concat those values and create a new column.

Upvotes: 0

Views: 65

Answers (2)

jezrael
jezrael

Reputation: 863166

Use Series.str.findall with word boundaries for each value of lsit and then join together by Series.str.join:

pat = '|'.join(r"\b{}\b".format(x) for x in List_to_map)
df['Values'] = df['String'].astype(str).str.findall(pat).str.join('+')
print (df)
   Id             String       Values
0   1   JHA PQR STO KJAN  JHA+PQR+STO
1   2  LKS JHA PLA; NIYM      LKS+JHA
2   3   LMA\KHA HYS,KNSN          LMA
3   4  JHA, PQR STO KJAM  JHA+PQR+STO
4   5  JHA PQR|STO/KJAOP  JHA+PQR+STO

Upvotes: 2

SeaBean
SeaBean

Reputation: 23217

You can use .str.split() to split on non-word character with regex \W, then get the common elements with List_to_map by np.intersect1d(). Finally, join the matching strings with + using .str.join(), as follows:

import numpy as np
List_to_map = ['JHA', 'LMA', 'STO', 'PQR', 'LKS']

df['Values'] = df['String'].str.split(r'\W').apply(lambda x: np.intersect1d(x, List_to_map)).str.join('+')

Result:

print(df)


   Id             String       Values
0   1   JHA PQR STO KJAN  JHA+PQR+STO
1   2  LKS JHA PLA; NIYM      JHA+LKS
2   3   LMA\KHA HYS,KNSN          LMA
3   4  JHA, PQR STO KJAM  JHA+PQR+STO
4   5  JHA PQR|STO/KJAOP  JHA+PQR+STO

Alternatively, if you want to maintain the sequence of original string, you can also use:

df['Values'] = df['String'].str.split(r'\W').apply(lambda x: [y for y in x if y in List_to_map]).str.join('+')

Result:

print(df)


   Id             String       Values
0   1   JHA PQR STO KJAN  JHA+PQR+STO
1   2  LKS JHA PLA; NIYM      LKS+JHA
2   3   LMA\KHA HYS,KNSN          LMA
3   4  JHA, PQR STO KJAM  JHA+PQR+STO
4   5  JHA PQR|STO/KJAOP  JHA+PQR+STO

Note that using the numpy function np.intersect1d() is faster than using Python list comprehension. However, the matching list will be based on the List_to_map string sequence. If string concat sequence is not important, I would recommend using np.intersect1d() for faster execution time.

Upvotes: 2

Related Questions