Reputation: 383
I have a largish DataFrame with a date index ['Date'] and several columns. One column is a string identifier ['Type'], with related data in the remaining columns. I need to add newData to the DataFrame, but only if the date-type pair (i.e. index-ColumnValue pair) is not already present in the DataFrame. Checking for the existing pairing takes up ~95% of my code computing time, so I really need to find a quicker way to do it.
Options already considered, with timings in increasing order of speed::
existing_pair = len(compiledData[(compiledData['Type'] == newData['Type'])
& (compiledData.index == newData['Date'])]) > 0
# average = 114 ms
existing_pair = newData['Date'] in compiledData[compiledData['Type'] ==
newData['Type']].index
# average = 68 ms
existing_pair = compiledData[compiledData.index == newData['Type']]['Type'].
isin([newData['Date']]).any()
# average = 44 ms
I am relatively new to Python so I am sure there are better (= faster) ways of checking for an index-colVal pair. Or, it may be that my entire data structure is wrong. Would appreciate any pointers anyone can offer.
Edit: Sample of the compiledData dataframe:
Type ColA ColB ColC ColD ColE ColF
2021-01-19 B 83.0 -122.15 0.0 11.0 11.000 11.0
2021-01-19 D 83.0 -1495.48 0.0 11.0 11.000 11.0
2021-03-25 D 83.0 432.00 0.0 11.0 11.000 11.0
2021-04-14 D 83.0 646.00 0.0 11.0 11.000 11.0
2021-04-16 A 20.0 11.00 0.0 30.0 11.000 11.0
2021-04-25 D 83.0 -26.82 0.0 11.0 11.000 11.0
2021-04-28 B 83.0 -651.00 0.0 11.0 11.000 11.0
Upvotes: 5
Views: 1157
Reputation: 11171
Index value lookup is faster than column value lookup. I don't know the implementation details (it looks like lookup depends on number of rows). Here is a performance comparison:
def test_value_matches(df, v1, v2):
# return True if v1, v2 found in df columns, else return False
if any(df[(df.c1 == v1) & (df.c2 == v2)]):
return True
return False
def test_index_matches(df, v1, v2):
# returns True if (v1, v2) found in (multi) index, else returns False
if (v1, v2) in df.index:
return True
return False
# test dependence of funcs above on num rows in df:
for n in [int(j) for j in [1e4, 1e5, 1e6, 1e7]]:
df = pd.DataFrame(np.random.random(size=(n, 2)), columns=["c1", "c2"])
v1, v2 = df.sample(n=1).iloc[0]
%timeit test_value_matches(df, v1, v2)
# create an index based on column values:
df2 = df.set_index(["c1", "c2"])
%timeit test_index_matches(df2, v1, v2)
output
421 µs ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10.5 µs ± 175 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
557 µs ± 5.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10.3 µs ± 143 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.77 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16.5 µs ± 185 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
22.4 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
28.1 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note that this ignores indexing time itself, which may be significant; this approach probably works best with repeated lookups on the same df. For n=1e7
, the performance is something like your problem on my machine; the indexed version is ~1000x faster (although apparently growing with n
).
Upvotes: 2
Reputation: 41327
Try using Index.difference
with a MultiIndex
:
It looks like compiledData
already has a date index, so append Type
to the index:
compiledData = compiledData.set_index('Type', append=True)
And it looks like newData
has Date
as an independent column, so set its index to ['Date', 'Type']
:
newData = newData.set_index(['Date', 'Type'])
Now that both have a date/type MultiIndex
, take their Index.difference
to get the unique newData
indexes:
unique = newData.index.difference(compiledData.index)
So newData.loc[unique]
rows can be added using append
:
compiledData.append(newData.loc[unique]).reset_index(level=1)
Or concat
:
pd.concat([compiledData, newData.loc[unique]]).reset_index(level=1)
Upvotes: 1