Mistapopo
Mistapopo

Reputation: 433

An unexpected increase of the number of rows after using dataframe.merge

I have some dataframes in a dictionary, and I want to merge all these dataframes using a common column "date". To do so, I used the following code :

n = len(dictionary)
something = dictionary[dictionnary_keys[0]]
for i in range(1,n):
     something = something.merge(dictionary[dictionnary_keys[i], on="date")
     print(something.shape)

Note that every dictionary's value is a pandas dataframe and its shape is (500,10). When I run that code, I get a memory error because both number of rows and columns increase ... However, only the number of columns has to increase. I don't understand why I get this result. Can someone explain me how to deal with such a situation ?

Thank you for your help. If you want more information, just let me know :)

Upvotes: 2

Views: 1408

Answers (1)

Alex
Alex

Reputation: 7065

You most likely have duplicated date values.

Here is a quick example:

# Generate dict of DatFrame with duplicated 'a'
d = dict()
for i in range(4):
    d[i] = pd.DataFrame({'a': list('ABBCD'), 'b':np.random.randint(0, 10, 5), 'c': i})

n = len(d)
s = d[0]
for i in range(1,n):
    s = s.merge(d[i], on="a")
    print(s.shape)
(7, 5)
(11, 7)
(19, 9)

Re-run with no duplicates:

d = dict()
for i in range(4):
    d[i] = pd.DataFrame({'a': list('ABCDE'), 'b':np.random.randint(0, 10, 5), 'c': i})
n = len(d)
s = d[0]
for i in range(1,n):
    s = s.merge(d[i], on="a")
    print(s.shape)
(5, 5)
(5, 7)
(5, 9)

Merging in this way might lead to complications with how your series are named:

   a  b_x  c_x  b_y  c_y  b_x  c_x  b_y  c_y
0  A    4    0    5    1    0    2    9    3
1  B    5    0    8    1    3    2    0    3
2  C    6    0    0    1    5    2    8    3
3  D    2    0    0    1    8    2    8    3
4  E    8    0    2    1    7    2    9    3

s['b_x']
   b_x  b_x
0    4    0
1    5    3
2    6    5
3    2    8
4    8    7

Upvotes: 2

Related Questions