Reputation:
I have a column data as follows:
abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hokg|
abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|hofg|
abc|frame|gtk|enst.34|pc|hg|,abc|framex|gtk1|enst.67|pxc|h5g|,abc|frbx|hgk4|enst.39|pik|hoqg|
I want to search and extract specific keywords within the frame and extract only that data with in the separators
Specific keywords are
enst.35
enst.18
enst.98
enst.63
The expected output is
abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA
I tried this herebut was not working effectively
Upvotes: 1
Views: 50
Reputation: 16683
With str.extract
and str.split(',)
to take the last comma separated value:
df['Data2'] = df['Data'].str.extract('(^.*enst.35\|.+?\|.+?\||^.*enst.18\|.+?\|.+?\||^.*enst.98\|.+?\|.+?\||^.*enst.63\|.+?\|.+?\||)', expand=False).str.split(',').str[-1]
You could create a list of keywords and do list comprehension as well per another answer.
Upvotes: 0
Reputation: 71689
You can construct a regex pattern using the given keywords
then use Series.str.findall
to find all occurrences of regex in series:
import re
keywords = ['enst.35','enst.18','enst.98','enst.63']
pattern = '|'.join([rf'[^,]+{re.escape(k)}[^,]+'for k in keywords])
result = df['col'].str.findall(pattern).str.get(0)
#print(result)
0 abc|framex|gtk4|enst.35|pxc|h5g|
1 abc|frbx|hgk4|enst.18|pif|homg|
2 abc|frame|gtk|enst.98|pc|hg|
3 NaN
Name: col, dtype: object
You can test the regex pattern here
Upvotes: 1
Reputation:
You can try in bashscript as follows:
for STRING in enst.35 enst.18 enst.98 enst.63; do
tr \, \\n < file.txt | grep "$STRING" || echo NA
done
Upvotes: 0