Reputation: 806
I would like to merge two columns into one as a list of words/tokens. Currently my dataset looks like:
A_Col B_Col C_Col
home my house I have a new house
paper research paper my mobile phone is broken
NaN NaN zoe zaczek who
NaN NaN two per cent
NaN is a value for empty field.
What I would like to do is the following: keeping column A_Col
but merge B_Col
and C_Col
in order to have something like this:
A_Col BC_Col
home ['my', 'house','I', 'have', 'a', 'new', 'house']
paper ['research', 'paper', 'my', 'mobile', 'phone', 'is,','broken']
NaN ['zoe', 'zaczek', 'who']
NaN ['two', 'per', 'cent']
Looking at the problem, the steps required should be:
B_Col
;C_Col
;For the first two points I am using the following:
df['B_Col'] = df.apply(lambda row: nltk.word_tokenize(row['B_Col']))
df['C_Col'] = df.apply(lambda row: nltk.word_tokenize(row['C_Col']))
For merging the results:
df['BC_Col'] = df['B_Col'] + df['C_Col']
Then I should remove NaN
values.
However something does not work in my code as I am not getting the tokenisation for B_Col
and C_Col
.
I hope you can help me to understand my error. Thanks.
Upvotes: 1
Views: 715
Reputation: 2417
you could do:
df['BC_Col'] = df['B_Col'].fillna('').str.split() + df['C_Col'].fillna('').str.split()
df
A_Col B_Col C_Col BC_Col
0 home my house I have a new house [my, house, I, have, a, new, house]
1 paper research paper my mobile phone is broken [research, paper, my, mobile, phone, is, broken]
2 NaN NaN zoe zaczek who [zoe, zaczek, who]
3 NaN NaN two per cent [two, per, cent]
Upvotes: 1