Reputation: 429
So I am back with another question about python and pandas. I have table1 with following columns:
ID;COUNT;FOREIGN_ID;OTHER_DATA
1;3;xyz1
2;1;xyz2
3;1;xyz3
table2
ID;FOREIGN_ID;OTHER_DATA
1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005
6;xyz2;000000
7;xyz2;000000
8;xyz3;000000
9;xyz3;000000
Both tables are stored as CSV files. I load both of them into dataframe, and then iterate through TABLE1. I must find all records in table2 with same record and randomly select some of them.
df_result = pd.DataFrame()
df_table1 = pd.read_csv(table1, delimiter=';')
df_table2 = pd.read_csv(table2, delimiter=';')
for index, row in df_table1 .iterrows():
df_candidates = df_table2[(df_table2['FOREIGN_ID'] == row['FOREIGN_ID']
random_numbers = np.random.choice(len(df_kandidati), row['count'], replace=False)
df_result.append(df_candidates.iloc[random_numbers])
In my earlier question I got an answer that using For loop is big time waster... But for this problem I can't find a solution where I wouldn't need to use for loop.
EDIT: I am sorry for editing my question so late.. was busy with other stuff... As requested below is the result_table. Please note that my real tables are slightly different than those below. I am joining tables on 3 foreign keys in my real use but for demonstration, I am using tables with fake data.
So the logic should be something like this: Read the first line of table1.
1;3;xyz1
Find all records with same FOREIGN_ID in table2 count = 3, foreign_id = xyz1 Rows with foreign_id = xyz1 are rows:
1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005
Because count = 3 I must randomly choose 3 of those records. I do this with the following line: df_candidates is table of all suitable records (table above)
random_numbers = np.random.choice(len(df_candidates), row['count'], replace=False)
Then I store randomly chosen records in a df_result after parsing all rows from table1 I write df_result to the csv.
Problem is that my tables are 0.5milion - 1 milion rows big so iterating through every row in table1 is really slow... And I am sure there is a better way of doing this.. But I've been stuck on this for past 2 days so..
Upvotes: 0
Views: 72
Reputation: 3733
To select rows, containing only values from Table1, you can use, for example, pd.merge :
col = "FOREIGN_ID"
left = df_table2
right = df_table1[[col]]
filtered = pd.merge(left=left, right=right, on=col, how="inner")
Or df.isin():
ix = df_table2[col].isin(df_table1[col])
filtered = df_table2[ix]
Then to select random sample per group:
def select_random_row(grp):
choice = np.random.randint(len(grp))
return grp.iloc[choice]
filtered.groupby(col).apply(select_random_row)
Upvotes: 1
Reputation: 462
Have you looked into using pd.merge()
Your call would look something like:
results=pd.merge(table1, table2, how='inner', on='FOREIGN_ID')
Upvotes: 1