Reputation: 59
How do I use pandas to come up with a joined result of aoiFeatures and allFeaturesReadings that results in this:
183 0.03
845 0.03
853 0.01
Given the following starting code and data:
import numpy
import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]
allFeaturesReadings = zip(allFeatures, allReadings)
#
# Use pandas to create Series and Join here?
#
sAllFeaturesReadings = pd.Series(dict(allFeaturesReadings))
sAOIFeatures = pd.Series(numpy.ma.filled(aoiFeatures))
sIndexedAOIFeatures = sAOIFeatures.reindex(numpy.ma.filled(aoiFeatures))
result = pd.concat([sIndexedAOIFeatures,sAllFeaturesReadings], axis=1, join='inner')
Upvotes: 1
Views: 44
Reputation: 879621
You could use isin
:
import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]
df = pd.DataFrame({'features':allFeatures, 'readings':allReadings})
result = df.loc[df['features'].isin(aoiFeatures)]
print(result)
yields
features readings
3 183 0.03
6 845 0.03
10 853 0.01
If you plan on selecting rows based on feature
values often, and if the features
can be made into a unique Index, and if the DataFrame is at least moderately large (say ~10K rows) then it may be better (for performance) to make features
the index:
import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]
df = pd.DataFrame({'readings':allReadings}, index=allFeatures)
result = df.loc[aoiFeatures]
print(result)
yields
readings
183 0.03
845 0.03
853 0.01
Here is the setup I used to make the IPython %timeit tests:
import pandas as pd
N = 10000
allFeatures = np.repeat(np.arange(N), 1)
allReadings = np.random.random(N)
aoiFeatures = np.random.choice(allFeatures, N//10, replace=False)
def using_isin():
df = pd.DataFrame({'features':allFeatures, 'readings':allReadings})
for i in range(1000):
result = df.loc[df['features'].isin(aoiFeatures)]
return result
def using_index():
df = pd.DataFrame({'readings':allReadings}, index=allFeatures)
for i in range(1000):
result = df.loc[aoiFeatures]
return result
This shows using_index
can be a bit faster:
In [108]: %timeit using_isin()
1 loop, best of 3: 697 ms per loop
In [109]: %timeit using_index()
1 loop, best of 3: 432 ms per loop
Note however, if allFeatures
contains duplicates, then making it the Index is NOT advantageous. For example, if you change the setup above to use:
allFeatures = np.repeat(np.arange(N//2), 2) # repeat every value twice
then
In [114]: %timeit using_isin()
1 loop, best of 3: 667 ms per loop
In [115]: %timeit using_index()
1 loop, best of 3: 3.47 s per loop
Upvotes: 0
Reputation: 4660
Without needing to zip you can do:
df = pd.DataFrame(data={"allFeatures":allFeatures, "allReadings":allReadings})
df[df["allFeatures"].isin(aoiFeatures)]
Upvotes: 1