Reputation: 135
I am trying to parse text data in Pandas DataFrame based on certain tags and values in another column's fields and store them in their own columns. For example, if I created this dataframe, df:
df = pd.DataFrame([[1,2],['A: this is a value B: this is the b val C: and here is c.','A: and heres another a. C: and another c']])
df = df.T
df.columns = ['col1','col2']
df['tags'] = df['col2'].apply(lambda x: re.findall('(?:\s|)(\w*)(?::)',x))
all_tags = []
for val in df['tags']:
all_tags = all_tags + val
all_tags = list(set(all_tags))
for val in all_tags:
df[val] = ''
df:
col1 col2 tags A C B
0 1 A: this is a value B: this is the b val C: and... [A, B, C]
1 2 A: and heres another a. C: and another c [A, C]
How would I populate each of the new "tag" columns with their values from col2 so I get this df:
col1 col2 tags \
0 1 A: this is a value B: this is the b val C: and... [A, B, C]
1 2 A: and heres another a. C: and another c [A, C]
A C B
0 this is a value and here is c. this is the b val
1 and heres another a. and another c
Upvotes: 2
Views: 2151
Reputation: 214957
Another option using str.extractall
with regex (?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$)
:
The regex captures the key (?P<key>\w+)
before the semi colon and value after the semi colon (?P<val>[^:]*)
as two separate columns key
and val
, the val
will match non :
characters until it reaches the next key value pair restricted by a look ahead syntax (?=\w+:|$)
; This assumes the key is always a single word which would be ambiguous otherwise:
import re
pat = re.compile("(?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$)")
pd.concat([
df,
(
df.col2.str.extractall(pat)
.reset_index('match', drop=True)
.set_index('key', append=True)
.val.unstack('key')
)
], axis=1).fillna('')
Where str.extractall
gives:
df.col2.str.extractall(pat)
And then you pivot the result and concatenate with the original data frame.
Upvotes: 3
Reputation: 76927
Here's one way
In [683]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
.apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x])))
)
Out[683]:
A B C
0 this is a value this is the b val and here is c.
1 and heres another a. NaN and another c
You could append back the results using join
In [690]: df.join(df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
.apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x]))))
Out[690]:
col1 col2 tags \
0 1 A: this is a value B: this is the b val C: and... [A, B, C]
1 2 A: and heres another a. C: and another c [A, C]
A B C
0 this is a value this is the b val and here is c.
1 and heres another a. NaN and another c
Infact, you could get df['tags']
using string method
In [688]: df.col2.str.findall('(?:\s|)(\w*)(?::)')
Out[688]:
0 [A, B, C]
1 [A, C]
Name: col2, dtype: object
Details:
Split groups into lists
In [684]: df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
Out[684]:
0 [A: this is a value, B: this is the b val, C: ...
1 [A: and heres another a., C: and another c]
Name: col2, dtype: object
Now, to key and value pairs of lists.
In [685]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
.apply(lambda x: [v.split(':', 1) for v in x]))
Out[685]:
0 [[A, this is a value], [B, this is the b val...
1 [[A, and heres another a.], [C, and another c]]
Name: col2, dtype: object
Upvotes: 2