Reputation: 195
I'm reading an excel file into a dataframe where one of its columns has 1 to many values, delimited by a space. I need to search the column for a string of text, and if found return the complete value between its delimiters(not the entire cell value).
Input would look something like this –
import pandas as pd
df = pd.DataFrame({'command': ['abc123', 'abcdef', 'hold', 'why'],
'name': ['/l1/l2/good/a/b/c /l1/l2/bad/b /l1/l2/fred/x /la/lb/sure/blah',
'/l1/l2/fred/a/b/c /l1/l2/bad/b /l1/l2/fred/x',
'/l2/l3/betty/a/b/c /l1/l2/bad/b /l1/l2/fred/x /l1/l2/good/ha',
'/la/lb2/sure/a/b/c'],
'date': ['2020-05', '2020-05', '2020-05', '2020-06']})
With a print it returns -
command date name
0 abc123 2020-05 /l1/l2/good/a/b/c /l1/l2/bad/b /l1/l2/fred/x /la/lb/sure/blah
1 abcdef 2020-05 /l1/l2/fred/a/b/c /l1/l2/bad/b /l1/l2/fred/x
2 hold 2020-05 /l2/l3/betty/a/b/c /l1/l2/bad/b /l1/l2/fred/x /l1/l2/good/ha
3 why 2020-06 /la/lb2/sure/a/b/c
I'm using the following to parse for the strings I'm after –
terms = ['/l1/l2/good', '/la/lb/sure']
df = df[df['name'].str.contains('|'.join(terms))]
which returns –
command date name
0 abc123 2020-05 /l1/l2/good/a/b/c /l1/l2/bad/b /l1/l2/fred/x /la/lb/sure/blah
2 hold 2020-05 /l2/l3/betty/a/b/c /l1/l2/bad/b /l1/l2/fred/x /l1/l2/good/ha
3 why 2020-06 /la/lb/sure/a/b/c
What I would like returned is –
command date name
0 abc123 2020-05 /l1/l2/good/a/b/c /la/lb/sure/blah
2 hold 2020-05 /l1/l2/good/ha
3 why 2020-06 /la/lb/sure/a/b/c
I had tried performing a split on the space delimiter, but then I'm unable to loop through those values to parse them as needed.
Thanks.
Upvotes: 1
Views: 125
Reputation: 15962
The problem with using replace()
is that it will only apply to what matches, and not to the items you want to skip/remove from the results. Instead, use findall()
:
pat = f"((?:{'|'.join(terms)})[^ ]*)"
# `pat` is ((?:/l1/l2/good|/la/lb/sure)[^ ]*)
That regex attempts to match with each pattern and everything that follows it except a space.
df['name'].str.findall(pat) # will give all the matches, as a series:
0 [/l1/l2/good/a/b/c, /la/lb/sure/blah]
1 []
2 [/l1/l2/good/ha]
3 [/la/lb/sure/a/b/c]
Name: name, dtype: object
join
the results with a space and assign it back to 'name':
df['name'] = df['name'].str.findall(pat).str.join(' ')
Result df
:
command name date
0 abc123 /l1/l2/good/a/b/c /la/lb/sure/blah 2020-05
1 abcdef 2020-05
2 hold /l1/l2/good/ha 2020-05
3 why /la/lb/sure/a/b/c 2020-06
(Btw, in your DataFrame definition, your last entry is '/la/lb2/sure/a/b/c'
instead of '/la/lb/sure/a/b/c'
. I've changed that. And date
is the last column in your def.)
Upvotes: 2
Reputation: 9481
This splits the string in the columns at each space, and for the resulting entries looks if the entries in your terms list are a part of them. If there is a perfect match, the respective strings are rejoined:
df['name'].apply(lambda x: ' '.join([a for a in x.split(' ') for t in terms if a.find(t)!=-1]))
Limitations: this does not do partial matches. For instance:
'/l1/l2/good/a/b/c' contains '/l1/l2/good' and therefore would be retained.
'/l1/l3/good/a/b/c' does not contain '/l1/l2/good' and therefore would not be retained.
All of that said, this way of storing data does not lead to very easily accessible dataframes.
Upvotes: 0