Wiseface
Wiseface

Reputation: 192

Concatenate and sum column values while iterating

I am trying to create a function that will take in CSV files and create dataframes and concatenate/sum like so:

    id     number_of_visits
0   3902932804358904910  2
1   5972629290368575970  1
2   5345473950081783242  1
3   4289865755939302179  1
4   36619425050724793929 19

+ 

    id     number_of_visits
0   3902932804358904910  5
1   5972629290368575970  10
2   5345473950081783242  3
3   4289865755939302179  20
4   36619425050724793929 13

=

    id     number_of_visits
0   3902932804358904910  7
1   5972629290368575970  11
2   5345473950081783242  4
3   4289865755939302179  21
4   36619425050724793929 32

My main issue is that in the for loop after I create the dataframes, I tried to concatenate by df += new_df and new_df wasn't being added. So I tried the following implementation.

def add_dfs(files):
    master = []
    big = pd.DataFrame({'id': 0, 'number_of_visits': 0}, index=[0]) # dummy df to initialize
    for k in range(len(files)):
        new_df = create_df(str(files[k])) # helper method to read, create and clean dfs
        master.append(new_df) #creates a list of dataframes with in master
    for k in range(len(master)):
        big = pd.concat([big, master[k]]).groupby(['id', 'number_of_visits']).sum().reset_index()
        # iterate through list of dfs and add them together
    return big

Which gives me the following

    id   number_of_visits
1   1000036822946495682 2
2   1000036822946495682 4
3   1000044447054156512 1
4   1000044447054156512 9
5   1000131582129684623 1

So the number_of_visits for each user_id aren't actually adding together, they're just being sorted in order by number_of_visits

Upvotes: 0

Views: 236

Answers (2)

user15398259
user15398259

Reputation:

Pass your list of dataframes directly to concat() then group on the id and sum.

>>> pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
                     id  number_of_visits
0  36619425050724793929                32
1   3902932804358904910                 7
2   4289865755939302179                21
3   5345473950081783242                 4
4   5972629290368575970                11
def add_dfs(files):
    master = []
    for f in files:
        new_df = create_df(f) 
        master.append(new_df) 
    big = pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
        
    return big

Upvotes: 1

Albo
Albo

Reputation: 1644

You can use

df1['number_of_visits'] += df2['number_of_visits']

this gives you:

|    |                   id |   number_of_visits |
|---:|---------------------:|-------------------:|
|  0 |  3902932804358904910 |                  7 |
|  1 |  5972629290368575970 |                 11 |
|  2 |  5345473950081783242 |                  4 |
|  3 |  4289865755939302179 |                 21 |
|  4 | 36619425050724793929 |                 32 |

Upvotes: 0

Related Questions