Reputation: 11

Delete rows not present in all DataFrame in pandas

Background

There is 5 DataFrames, I will call them b1, b2, b3, b4, b5.

They have data structure which consists of the columns ['Date', 'Value']

I got the data from 2014 to 2015.

Problem

Every DataFrame has different date counting system. So, I want to get rid of the rows in the DataFrames which have non-matching dates.

How can I do that?

What I have tried

So, I will delete all the data if Date of the data is not included in all DataFrames: b1, b2, b3, b4, b5

This works for the first time

for i in range(len(b2.index)):
    k = 0
    for j in range(len(b1.index)):
        if b2['Date'][i] == b1['Date'][j]:
            k = k+1
        else:
            k = k
    if k == 1:
        pass
    if k == 0:
        b2 = b2.drop([i])

But, after that I excute this code one more time after I did that, there would be some error like this:

KeyError Traceback (most recent call last) in () 2k = 0 3for j in range(len(b2.index)): ----> 4 if b1['Date'][i] == b2['Date'][j]: 5 k = k+1 6 else:

C:\Users\cms\Anaconda\lib\site-packages\pandas\core\series.pyc in getitem(self, key) 519def getitem(self, key): 520 try: --> 521 result = self.index.get_value(self, key) 522 523 if not np.isscalar(result):

C:\Users\cms\Anaconda\lib\site-packages\pandas\core\index.pyc in get_value(self, series, key) 1593 1594 try: -> 1595 return self._engine.get_value(s, k) 1596 except KeyError as e1: 1597 if len(self) > 0 and self.inferred_type in ['integer','boolean']:

pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3113)()

pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:2844)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3704)()

pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7224)()

pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7162)()

KeyError: 28L

What I want to do is

mlist = (b1,b2,b3,b4,b5)
for q in mlist:
    for r in mlist:
        for i in range(len(q.index)):
            k = 0
            for j in range(len(r.index)):
                if q['Date'][i] == r['Date'][j]:
                    k = k+1
                else:
                    k = k
            if k == 1:
                pass
            if k == 0:
                q = q.drop([i])`enter code here`

Upvotes: 1

Answers (3)

dermen

Reputation: 5382

From this posting you can see an answer

merged_mlist = reduce(lambda left,right: pandas.merge(left,right,on='Date', how='inner'), mlist)

I believe the reason your code is failing is because you are not resetting the index after dropping. I think if you want to do this manually, you can do something like store the indices first and then drop all at once

inds_to_drop = []
for i in range(len(b2)):
    for j in range(len(b1)):
        if b2['Date'][i] != b1['Date'][j]:
            inds_to_drop.append(i)
b2.drop( b2.index[inds_to_drop], inplace=True)    
b2.reset_index( drop=True,inplace=True) # this may or may not be necessary, havent thought all the way through

Better yet, if you still want to implement your for loop

inds_to_drop = [ i for i,dt in enumerate(b2.Date) if not pandas.np.where( b1.Date==dt)[0].size ]

You can do that for each iteration.. Though it is just easier to run the reduce function paired with pandas.merge

Upvotes: 0

firelynx

Reputation: 32244

You want to keep the dates that are present in all datasets.

This can easily be done by performing an inner join on all of them using the pandas.merge() function.

b = b1.merge(on='Date', right=b2, how='inner', suffixes=['', '_b2'])
b = b.merge(on='Date', right=b3, how='inner', suffixes=['', '_b3'])
b = b.merge(on='Date', right=b4, how='inner', suffixes=['', '_b4'])
b = b.merge(on='Date', right=b5, how='inner', suffixes=['_b1', '_b5'])

I am making some assumptions about your data here since you did not post any example data yet.

If I got anything wrong about your data, please let me know and I will correct my example.

Example:

b1 = pd.DataFrame({'Date':pd.date_range('2015-05-05', '2015-05-10'), 'Value':range(1,7)})
        Date  Value
0 2015-05-05      1
1 2015-05-06      2
2 2015-05-07      3
3 2015-05-08      4
4 2015-05-09      5
5 2015-05-10      6

b2 = pd.DataFrame({'Date':pd.date_range('2015-05-07', '2015-05-12'), 'Value':range(4,10)})
        Date  Value
0 2015-05-05      1
1 2015-05-06      2
2 2015-05-07      3
3 2015-05-08      4
4 2015-05-09      5
5 2015-05-10      6

b = b1.merge(on='Date', right=b2, how='inner', suffixes=['_b1', '_b2'])
        Date  Value_b1  Value_b2
0 2015-05-07         3         4
1 2015-05-08         4         5
2 2015-05-09         5         6
3 2015-05-10         6         7

Upvotes: 0

Jihun

Reputation: 1485

How about this? Use set.intersecton() :

s = set(b1.index)
for b in [b2,b3,b4,b5]:
    s=s.intersection(set(b.index))
for b in [b1,b2,b3,b4,b5]:
    b=b.drop(b.index[-b.index.isin(s)],inplace=True)

Upvotes: 0

Delete rows not present in all DataFrame in pandas

Background

Problem

What I have tried

Answers (3)

Related Questions