Reputation: 573
I am struggling with a problem in Pandas. Basically I want to add a new column (from a document-dataframe) to a bigger dataframe (corpus-dataframe), and the two dataframes have different shapes and indexes. So I understand that I have to map the index of both and then add it. I have been trying different things (merge, loc...) and I don't find the solution. Here the example:
On the one side, I have a general dataframe with all the tokens in the corpus (corpus-dataframe):
|-------|------------------|
| token | mean_freq_corpus |
|-------|------------------|
| de | 35 |
|-------|------------------|
| el | 29 |
|-------|------------------|
| la | 15 |
|-------|------------------|
| en | 10 |
|-------|------------------|
| se | 5 |
|-------|------------------|
On the other side I have a bunch of dataframes also with the information about the frequency of this token in each specific document (document-dataframes). Like:
|-------|------------------|
| token | Doc_1 |
|-------|------------------|
| de | 20 |
|-------|------------------|
| se | 10 |
|-------|------------------|
|-------|------------------|
| token | Doc_2 |
|-------|------------------|
| el | 15 |
|-------|------------------|
| la | 12 |
|-------|------------------|
|-------|------------------|
| token | Doc_3 |
|-------|------------------|
| de | 30 |
|-------|------------------|
| el | 20 |
|-------|------------------|
| la | 12 |
|-------|------------------|
| en | 10 |
|-------|------------------|
As you see, the shapes are different, and the tokens from which I have the information are different, so I cannot just add this information to the general_frequency table without taking in consideration the indexes. What I would like at the end is something like:
|-------|------------------|-----|-----|-----|
| token | mean_freq_corpus |Doc_1|Doc_2|Doc_3|
|-------|------------------|-----|-----|-----|
| de | 35 | 20 | 0 | 30 |
|-------|------------------|-----|-----|-----|
| el | 29 | 0 | 10 | 20 |
|-------|------------------|-----|-----|-----|
| la | 15 | 0 | 15 | 12 |
|-------|------------------|-----|-----|-----|
| en | 10 | 0 | 0 | 10 |
|-------|------------------|-----|-----|-----|
| se | 5 | 10 | 0 | 0 |
|-------|------------------|-----|-----|-----|
I could iterate over the different document-dataframes and append the value in the corpus-dataframe using df.at[index,column], but I wanted to know if there is another possibility using map, lambda or anything like that.
Please, take in consideration that the examples are simplified, each table have a couple of thousands and I have several thousands of these document-tables. Thanks!
Upvotes: 1
Views: 87
Reputation: 863166
Use concat
for join all small DataFrame
s and then join
to first DataFrame
with left join:
dfs = [df1, df2, df3]
df_all = pd.concat([x.set_index('token') for x in dfs], axis=1).fillna(0).astype(int)
df = df.join(df_all, on='token')
print (df)
token mean_freq_corpus Doc_1 Doc_2 Doc_3
0 de 35 20 0 30
1 el 29 0 15 20
2 la 15 0 12 12
3 en 10 0 0 10
4 se 5 10 0 0
Upvotes: 2