Reputation: 881
I have a tsv file with multiple columns. There are 10 and more columns but the columns important to me are the ones with the name user_name, shift_id, url_id. I want to create a data frame that first separates the entire csv file based on user_names i.e only rows with same user_name are grouped together. From that chunk I make another chunk where only rows with certain shift_id are grouped together and then from that chunk make a chunk with same url. I unfortunately cannot share the data because of the company rule and making an imaginary data table might be more confusing.
Two of the other columns have time-stamps. I want to calculate the time duration of the chunk but only after I group chunk according to those columns.
I have seen answers that split data-frame by a specific column value,but in my case I have three column values and the order in which they are separated matters too.
Thank you for your help!
Upvotes: 3
Views: 1485
Reputation: 18208
Assuming you read the columns to dataframe
df = pd.DataFrame({'col1':[1,2,3], 'col2':[4,5,6],'col3':[7,8,9],
'col4':[1,2,3],'col5':[1,2,3],'col6':[1,2,3],
'col7':[1,2,3],'col8':[1,2,3],'col9':[1,2,3],
'col91':[1,2,3]})
print(df)
Output:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col91
0 1 4 7 1 1 1 1 1 1 1
1 2 5 8 2 2 2 2 2 2 2
2 3 6 9 3 3 3 3 3 3 3
Now, we can select only three columns of interest, let it be col1, col2, and col3
tmp_df = df[['col1', 'col2', 'col3']]
print(tmp_df)
Output:
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 9
Further we want to filter based on three column values:
final_df = tmp_df[(tmp_df.col1 == 1) & (tmp_df.col2 == 4) & (tmp_df.col3== 7)]
print(final_df)
Output:
col1 col2 col3
0 1 4 7
After reading to dataframe
, all these above steps can be acheived in single line:
final = df[['col1', 'col2', 'col3']][(df.col1 == 1) & (df.col2 == 4) & (df.col3== 7)]
final
Hope it helps!
df = pd.DataFrame({'col1':[1,1,1,1,1], 'col2':[4,4,4,4,7],'col3':[7,7,9,7,7],
'col4':['X','X','X','X','X'],'col5':['X','X','X','X','X'],'col6':['X','X','X','X','X'],
'col7':['X','X','X','X','X'],'col8':['X','X','X','X','X'],'col9':['X','X','X','X','X'],
'col91':['X','X','X','X','X']})
print(df)
Output:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col91
0 1 4 7 X X X X X X X
1 1 4 7 X X X X X X X
2 1 4 9 X X X X X X X
3 1 4 7 X X X X X X X
4 1 7 7 X X X X X X X
Now, usinig similar masking as above:
final = df[(df.col1 == 1) & (df.col2 == 4) & (df.col3== 7)]
final
Output:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col91
0 1 4 7 X X X X X X X
1 1 4 7 X X X X X X X
3 1 4 7 X X X X X X X
Upvotes: 2