Reputation: 107
I have a large .csv file that has 11'000'000 rows and 3 columns: id ,magh , mixid2. What I have to do is to select the rows with the same id and then check if these rows have the same mixid2; if True I remove the rows, If False I initialize a class with the information of the selected rows. That is my code:
obs=obs.set_index('id')
obs=obs.sort_index()
#dropping elements with only one mixid2 and filling S
ID=obs.index.unique()
S=[]
good_bye_list = []
for i in tqdm(ID):
app=obs.loc[i]
if len(np.unique([app['mixid2'],])) != 1:
#fill the class list
S.append(star(app['magh'].values,app['mixid2'].values,z_in))
else :
#drop
good_bye_list.append(i)
obs=obs.drop(good_bye_list)
The .csv file is very large so it takes 40 min to compute everything. How can I improve the speed??
Thank you for the help.
This is the .csv file:
id,mixid2,magh
3447001203296326,557,14.25
3447001203296326,573,14.25
3447001203296326,525,14.25
3447001203296326,541,14.25
3447001203296330,540,15.33199977874756
3447001203296330,573,15.33199977874756
3447001203296333,172,17.476999282836914
3447001203296333,140,17.476999282836914
3447001203296333,188,17.476999282836914
3447001203296333,156,17.476999282836914
3447001203296334,566,15.626999855041506
3447001203296334,534,15.626999855041506
3447001203296334,550,15.626999855041506
3447001203296338,623,14.800999641418455
3447001203296338,639,14.800999641418455
3447001203296338,607,14.800999641418455
3447001203296344,521,12.8149995803833
3447001203296344,537,12.8149995803833
3447001203296344,553,12.8149995803833
3447001203296345,620,12.809000015258787
3447001203296345,543,12.809000015258787
3447001203296345,636,12.809000015258787
3447001203296347,558,12.315999984741213
3447001203296347,542,12.315999984741213
3447001203296347,526,12.315999984741213
3447001203296352,615,12.11299991607666
3447001203296352,631,12.11299991607666
3447001203296352,599,12.11299991607666
3447001203296360,540,16.926000595092773
3447001203296360,556,16.926000595092773
3447001203296360,572,16.926000595092773
3447001203296360,524,16.926000595092773
3447001203296367,490,15.80799961090088
3447001203296367,474,15.80799961090088
3447001203296367,458,15.80799961090088
3447001203296369,639,15.175000190734865
3447001203296369,591,15.175000190734865
3447001203296369,623,15.175000190734865
3447001203296369,607,15.175000190734865
3447001203296371,460,14.975000381469727
3447001203296373,582,14.532999992370605
3447001203296373,614,14.532999992370605
3447001203296373,598,14.532999992370605
3447001203296374,184,14.659000396728516
3447001203296374,203,14.659000396728516
3447001203296374,152,14.659000396728516
3447001203296374,136,14.659000396728516
3447001203296374,168,14.659000396728516
3447001203296375,592,14.723999977111815
3447001203296375,608,14.723999977111815
3447001203296375,624,14.723999977111815
3447001203296375,92,14.723999977111815
3447001203296375,76,14.723999977111815
3447001203296375,108,14.723999977111815
3447001203296375,576,14.723999977111815
3447001203296376,132,14.0649995803833
3447001203296376,164,14.0649995803833
3447001203296376,180,14.0649995803833
3447001203296376,148,14.0649995803833
3447001203296377,168,13.810999870300293
3447001203296377,152,13.810999870300293
3447001203296377,136,13.810999870300293
3447001203296377,184,13.810999870300293
3447001203296378,171,13.161999702453613
3447001203296378,187,13.161999702453613
3447001203296378,155,13.161999702453613
3447001203296378,139,13.161999702453613
3447001203296380,565,13.017999649047852
3447001203296380,517,13.017999649047852
3447001203296380,549,13.017999649047852
3447001203296380,533,13.017999649047852
3447001203296383,621,13.079999923706055
3447001203296383,589,13.079999923706055
3447001203296383,605,13.079999923706055
3447001203296384,541,12.732000350952148
3447001203296384,557,12.732000350952148
3447001203296384,525,12.732000350952148
3447001203296385,462,12.784000396728516
3447001203296386,626,12.663999557495115
3447001203296386,610,12.663999557495115
3447001203296386,577,12.663999557495115
3447001203296389,207,12.416000366210938
3447001203296389,255,12.416000366210938
3447001203296389,223,12.416000366210938
3447001203296389,239,12.416000366210938
3447001203296390,607,12.20199966430664
3447001203296390,591,12.20199966430664
3447001203296397,582,16.635000228881836
3447001203296397,598,16.635000228881836
3447001203296397,614,16.635000228881836
3447001203296399,630,17.229999542236328
3447001203296404,598,15.970000267028807
3447001203296404,631,15.970000267028807
3447001203296404,582,15.970000267028807
3447001203296408,540,16.08799934387207
3447001203296408,556,16.08799934387207
3447001203296408,524,16.08799934387207
3447001203296408,572,16.08799934387207
3447001203296409,632,15.84000015258789
3447001203296409,616,15.84000015258789
Upvotes: 2
Views: 144
Reputation: 720
new_df = app.groupby(['id','mixid2'], as_index=False).agg('count')
new_df = new_df[new_df['magh'] > 1]
then pass new_df to your function.
Upvotes: 0
Reputation: 106
I'm not entirely sure, if I understood you correctly. But what you can do is first remove duplicates in your dataframe and then use the groupby function to get all the remaining data points with same id:
# dropping all duplicates based on id an mixid2
df.drop_duplicates(["id", "mixid2"], inplace=True)
# then iterate over all groups:
for index, grp in df.groupby(["id"]):
pass # do stuff here with the grp
Normally it is a good idea to rely on pandas internal functions, since they are mostly optimised quite well.
Upvotes: 0
Reputation: 2887
Hello and welcome to StackOverflow.
In pandas the rule of thumb is that raw loops are always slower than the dedicated functions. To apply a function to a sub-DataFrame of rows that fulfill certain criteria you can use groupby
In your case the function is a bit ... unpythonic as the instantiation of S
is a side effect and the deleting of rows you are currenty iterating over is dangerous. For example in a dictionary you should never do this. That said, you can create a function like this:
In [37]: def my_func(df):
...: if df['mixid2'].nunique() == 1:
...: return None
...: else:
...: S.append(df['mixid2'])
...: return df
and apply it to you DataFrame via
S = []
obs.groupby('id').apply(my_func)
This iterates over all subdataframes with the same id
and drops them if there is exactly one unique value in mixid2
. Otherwise it appends the values to a list S
The resulting DataFrame is 3 rows shorter
Out[38]:
id mixid2 magh
id
3447001203296326 0 3447001203296326 557 14.250000
1 3447001203296326 573 14.250000
... ... ... ...
3447001203296409 98 3447001203296409 632 15.840000
99 3447001203296409 616 15.840000
[97 rows x 3 columns]
and S
contains 28 elements. That you could pass into the star
constructor just as you did.
Upvotes: 1
Reputation: 3224
I guess you want to groupby
and exclude all the elements where mixid2
appears more than 1 times using set_index
. To get the original shape, we use reset_index
after the filtering.
df = obs.set_index('mixid2').loc[~df.groupby('mixid2').count().id.eq(1)].reset_index()
df.shape
(44, 3)
Upvotes: 0