Reputation: 747
I have four tables: predicted_tags
, actual_tags
, tags_names
and news_text
.
In tables predicted_tags
and actual_tags
rows names are tags id. In these tables 1
means True and 0
means False.
Shape of predicted_tags
and actual_tags
is (23413, 1369).
predicted_tags
:
print(predicted_tags)
+-------+-----+---+-----+------+------+
| | 1 | 3 | ... | 8345 | 8347 |
+-------+-----+---+-----+------+------+
| 35615 | 0 | 0 | ... | 1 | 0 |
| 58666 | 1 | 0 | ... | 0 | 0 |
| 16197 | 0 | 0 | ... | 0 | 1 |
| 68824 | 0 | 0 | ... | 1 | 1 |
| 22277 | 0 | 0 | ... | 1 | 0 |
+-------+-----+---+-----+------+------+
actual_tags
:
print(actual_tags)
+-------+-----+---+-----+------+------+
| | 1 | 3 | ... | 8345 | 8347 |
+-------+-----+---+-----+------+------+
| 35615 | 0 | 0 | ... | 1 | 0 |
| 58666 | 1 | 1 | ... | 0 | 0 |
| 16197 | 0 | 0 | ... | 0 | 1 |
| 68824 | 0 | 0 | ... | 1 | 1 |
| 22277 | 0 | 1 | ... | 1 | 0 |
+-------+-----+---+-----+------+------+
tags_names
:
print(tags_names)
+--------+----------+-------------+
| | tag_id | tag_name |
+--------+----------+-------------+
| 127579 | 1 | politics |
| 108814 | 3 | economics |
| ... | ... | ... |
| 18 | 8345 | hot |
| 257141 | 8347 | environment |
+--------+----------+-------------+
news_text
:
print(news_text)
+----------+------------------------+-----------------------------+
| | news_name | news_content |
+----------+------------------------+-----------------------------+
| 35615 | Secret of… | Hi! Today I will talk... |
| 58666 | Conversations with a … | I have a big experience... |
| 16197 | Harm of alcohol | Today, we… |
| ... | ... | ... |
| 68824 | Hot news | Celebrity with... |
| 22277 | Finance market | Last week… |
+----------+------------------------+-----------------------------+
I want to get the next table:
+-------+------------------------+----------------------------+------------------------+---------------------------+
| | news_name | news_content | predicted_tags | actual_tags |
+-------+------------------------+----------------------------+------------------------+---------------------------+
| 35615 | Secret of… | Hi! Today I will talk... | ['hot'] | ['hot'] |
| 58666 | Conversations with a … | I have a big experience... | ['politics'] | ['politics', 'economics'] |
| 16197 | Harm of alcohol | Today, we… | ['environment'] | ['environment'] |
| 68824 | Hot news | Celebrity with... | ['hot', 'environment'] | ['hot', 'environment'] |
| 22277 | Finance market | Last week… | ['hot'] | ['hot', 'economics'] |
+-------+------------------------+----------------------------+------------------------+---------------------------+
How can I do this using Pandas?
Upvotes: 1
Views: 460
Reputation: 13401
Convert tags_names
df into dictionary and use it to rename columns:
tag_names = dict(zip(tags_names['tag_id'], tags_names['tag_names']))
predicted_tags.rename(columns = tag_names, inplace = True)
actual_tags.rename(columns = tag_names, inplace = True)
Get the column names where value is 1.
news_text['actual_tags'] = (actual_tags == 1 ).apply(lambda y: actual_tags.columns[y.tolist()].tolist(), axis=1)
news_text['predicted_tags'] = (predicted_tags == 1 ).apply(lambda y: predicted_tags.columns[y.tolist()].tolist(), axis=1)
Upvotes: 2
Reputation: 1441
You can convert one hot encoding of tags to list of tags by using pandas apply. I would modify tag_names from dataframe to a series (whose index is tag_id and value is tag name). I'm demonstrating this with only two tags for now.
>>> import pandas as pd
>>> df = pd.DataFrame({
1: [0, 1, 0, 0, 0],
3: [0, 1, 0, 0, 1]},
index=[35615, 58666, 16197, 68824, 22277] ) # predicted_tags
>>> df
1 3
35615 0 0
58666 1 1
16197 0 0
68824 0 0
22277 0 1
>>> tag_names = pd.DataFrame({"tag_id": [1,3,],
"tag_name": ["politics", "economics"]},
index=[127579, 108814])
>>> tag_names
tag_id tag_name
127579 1 politics
108814 3 economics
>>> tags = tag_names.set_index("tag_id").tag_name
>>> tags
tag_id
1 politics
3 economics
Name: tag_name, dtype: object
>>> df.apply( lambda row: [tags.loc[k] for k,v in row.items() if v > 0] , axis=1)
35615 []
58666 [politics, economics]
16197 []
68824 []
22277 [economics]
dtype: object
>>>
You should be now able to join this with news_text
on index.
Upvotes: 2
Reputation: 917
First of all, create a column which holds all the actual/predicted values, such as:
predicted_tags['pred_loc'] = predicted_tags.values.tolist()
actual_tags['actual_loc'] = actual_tags.values.tolist()
Also, if your tag_id (in tag_names dataFrame
) is in the same order as the columns in your actual and predicted tags dataFrame. Then, just create a list of tags names like
tags = tag_names.tag_name.values.tolist()
Now, before we convert, we will merge this to the news_text dataFrame
,
news_text = news_text.merge(predicted_tags['pred_loc'], how='outer', left_index=True, right_index=True)
news_text = news_text.merge(actual_tags['actual_loc'], how='outer', left_index=True, right_index=True)
Now, we convert:
news_text.pred_loc = news_text.pred_loc.apply(lambda x: [tags[i] for i, j in enumerate(x) if j == 1])
news_text.actual_loc = news_text.actual_loc.apply(lambda x: [tags[i] for i, j in enumerate(x) if j == 1])
Upvotes: 2