José
José

Reputation: 573

Add new column in Pandas Dataframe comparing indexes

I am struggling with a problem in Pandas. Basically I want to add a new column (from a document-dataframe) to a bigger dataframe (corpus-dataframe), and the two dataframes have different shapes and indexes. So I understand that I have to map the index of both and then add it. I have been trying different things (merge, loc...) and I don't find the solution. Here the example:

On the one side, I have a general dataframe with all the tokens in the corpus (corpus-dataframe):

|-------|------------------|
| token | mean_freq_corpus |
|-------|------------------|
| de    | 35               |
|-------|------------------|
| el    | 29               |
|-------|------------------|
| la    | 15               |
|-------|------------------|
| en    | 10               |
|-------|------------------|
| se    | 5                |
|-------|------------------|

On the other side I have a bunch of dataframes also with the information about the frequency of this token in each specific document (document-dataframes). Like:

|-------|------------------|
| token | Doc_1            |
|-------|------------------|
| de    | 20               |
|-------|------------------|
| se    | 10               |
|-------|------------------|


|-------|------------------|
| token | Doc_2            |
|-------|------------------|
| el    | 15               |
|-------|------------------|
| la    | 12               |
|-------|------------------|


|-------|------------------|
| token | Doc_3            |
|-------|------------------|
| de    | 30               |
|-------|------------------|
| el    | 20               |
|-------|------------------|
| la    | 12               |
|-------|------------------|
| en    | 10               |
|-------|------------------|

As you see, the shapes are different, and the tokens from which I have the information are different, so I cannot just add this information to the general_frequency table without taking in consideration the indexes. What I would like at the end is something like:

|-------|------------------|-----|-----|-----|
| token | mean_freq_corpus |Doc_1|Doc_2|Doc_3|
|-------|------------------|-----|-----|-----|
| de    | 35               | 20  |  0  | 30  |
|-------|------------------|-----|-----|-----|
| el    | 29               |  0  | 10  | 20  |
|-------|------------------|-----|-----|-----|
| la    | 15               |  0  | 15  | 12  |
|-------|------------------|-----|-----|-----|
| en    | 10               |  0  |  0  | 10  |
|-------|------------------|-----|-----|-----|
| se    | 5                | 10  |  0  | 0   |
|-------|------------------|-----|-----|-----|

I could iterate over the different document-dataframes and append the value in the corpus-dataframe using df.at[index,column], but I wanted to know if there is another possibility using map, lambda or anything like that.

Please, take in consideration that the examples are simplified, each table have a couple of thousands and I have several thousands of these document-tables. Thanks!

Upvotes: 1

Views: 87

Answers (1)

jezrael
jezrael

Reputation: 863166

Use concat for join all small DataFrames and then join to first DataFrame with left join:

dfs = [df1, df2, df3]
df_all = pd.concat([x.set_index('token') for x in dfs], axis=1).fillna(0).astype(int)
df = df.join(df_all, on='token')
print (df)
  token  mean_freq_corpus  Doc_1  Doc_2  Doc_3
0    de                35     20      0     30
1    el                29      0     15     20
2    la                15      0     12     12
3    en                10      0      0     10
4    se                 5     10      0      0

Upvotes: 2

Related Questions