ARJ
ARJ

Reputation: 2080

Using the first and last values of two columns and generate new data frame based on conditions

Say I have the following data frame,

df.head()
    ID  start    end    symbol    symbol_id    type
    1   146       291      bus    bus-201  CDS
    1   146      314      bus    bus-201  trans
    1   603       243      bus    bus-201  CDS
    1   1058      2123      car    car-203  CDS
    1   910       81      car    car-203  ex
    1   2623      2686      car    car-203  CDS
    1   5948       6043      car    car-203  CDS
    1   6348       6474      car    car-203  CDS
    1   910       81      car    car-201  ex
    1   910       81      car    car-201  ex
    1   636      650      car    car-203  CDS
    1   202      790      train    train-204  CDS
    1   200      314      train    train-204  CDS
    1   202      837      train    train-204  CDS

Now from the above data frame, I need to group by items based on column symbol_id if column type is CDS. Then, I need to use the first value from the column start as the value in start column of the new data frame and last value from columnendas the value in columnend`.

Finally, the df2 should look like,

start    end    symbol    symbol_id    type
146     243    bus        bus-203     CDS
1058    650    car        car-203     CDS
202     837    train      train-204    CDS

I have tried using list of values from df['symbol'],

sym_list=df['symbol'].tolist().drop_duplicates()
    for symbol in df['symbol'].values:
        if symbol in tuple(sym_list):
           df_symbol =df['symbol'].isin(symbol)

which threw the following error,

TypeError: only list-like objects are allowed to be passed to isin(), you passed a [str]

I was trying to capture the first and last value for each symbol and symbol_id value using,

start = df.query('type =="CDS"')[['start']].iloc[0]    
end = df.query('type =="CDS"')[['end']].iloc[-1]  

However, my data frame is quite big and I have more than 50,000 unique values for symbol, hence I need a better solution here.

Any help or suggestions are appreciated!!

Upvotes: 1

Views: 43

Answers (2)

Mohit Motwani
Mohit Motwani

Reputation: 4792

Try:

df_group = df[df['type']=='CDS'].groupby(['symbol_id', 'symbol', 'type'])
df_new = pd.DataFrame(columns =['start', 'end'])
df_new[['start', 'end']] = df_group.agg({'start':'first', 'end': 'last'})
df_new.reset_index()

   symbol_id    symbol  start   end type
0   bus-201      bus    146     243 CDS
1   car-203      car    1058    650 CDS
2   train-204   train   202     837 CDS

Edited using agg as used by @Dev Khadka

Upvotes: 1

Dev Khadka
Dev Khadka

Reputation: 5451

you can do it using group by and first and last aggrigate function

df[df["type"]=="CDS"].groupby("symbol_id").agg({"start":"first", "end":"last", "symbol":"first","symbol_id":"first", "type":"first"})

Upvotes: 2

Related Questions