Reputation: 704
I have a file with 9 columns. One of the columns holds such string
Unique
3:107912234-107912321(-)
4:107913333-107913322(+)
Y:222002110-221002100(+)
MT:34330044-343123232(-)
X:838377373-834121212(+)
~ 400,000 rows with different strings. How can I split it into 4 different columns in the the same df, I can use df.str(",")
if there was just one delimiter, but since this has different delimiters, I am lost.
expected output:
chr start end strand
3 107912234 107912321 -
4 107913333 107913322 +
Y 222002110 221002100 +
MT 34330044 343123232 -
X 838377373 834121212 +
Upvotes: 4
Views: 1785
Reputation: 1540
You can split with the delimiters using regex to loop over your df
import re
x ='X:838377373-834121212(-)'
[s for s in re.split('\-(?=[0-9])|:|\(|\)', x) if s]
Upvotes: 1
Reputation: 862511
You can use extract
:
df1 = df['Unique'].str.extract("(?P<ch>.*?):(?P<start>\d+)-(?P<end>\d+)\((?P<strand>[-+])",
expand=True)
print (df1)
ch start end strand
0 3 107912234 107912321 -
1 4 107913333 107913322 +
2 Y 222002110 221002100 +
3 MT 34330044 343123232 -
4 X 838377373 834121212 +
Thanks A-Za-z for suggestion - if data are not always +-
in strand
column:
df1 = df['Unique'].str.extract("(?P<ch>.*?):(?P<start>\d+)-(?P<end>\d+)\((?P<strand>.*)\)",
expand=True)
print (df1)
ch start end strand
0 3 107912234 107912321 -
1 4 107913333 107913322 +
2 Y 222002110 221002100 +
3 MT 34330044 343123232 -
4 X 838377373 834121212 +
If need add to original df
this columns use join
:
print (df.join(df1))
Unique ch start end strand
0 3:107912234-107912321(-) 3 107912234 107912321 -
1 4:107913333-107913322(+) 4 107913333 107913322 +
2 Y:222002110-221002100(+) Y 222002110 221002100 +
3 MT:34330044-343123232(-) MT 34330044 343123232 -
4 X:838377373-834121212(+) X 838377373 834121212 +
Upvotes: 5