Reputation: 608
I have a working code, which iterates df
and returns return other_df
. I am trying to vectorise it, as it is slow. Am trying to create a func
to df.apply(func)
.
The resulting dataframe is of longer length, that is why I seem to need to return an other dataframe as a result of .apply
.
My initial df
is a list of apartments, that contains a column of lists of rooms and their properties.
Each row contains something of this kind:
rooms | apartment number
[['375', 'LET', ''], | 12345
['335', 'LET', ''], |
['360', 'LET', ''], |
['295', 'double', ''],|
['360', 'LET', '']] |
__________________________________________________
I need a resulting df like:
apartment number | room number | price | if let
12345 | 12345-1 | 375 | True
12345 | 12345-2 | 335 | True
12345 | 12345-3 | 360 | True
12345 | 12345-4 | 295 | False
12345 | 12345-5 | 360 | True
The resulting df should be a df
of rooms. While transforming, there is some data cleaning and extraction is done, including room number assignment, based on object index in list, stored in initial df cell, which is I am not sure if possible to vectorise at all (?)
I see my options as wether to get all done at once with .apply
, if that is possible. If it is not, than I would need to cover initial df
into multi-index and than to pivot table.
My draft code looks like this:
def rooms_df(row):
columns=['room_price',
'room_type',
'en_suite',
'if_let',
'room_number',
'listing_id']
df = pd.DataFrame(columns=columns)
for room in row['rooms']:
number=0
if room[0] == 'na':
room_price = None
room_type = None
en_suite = None
if_let = None
elif room[0] == 'occupied':
room_price = None
room_type = None
en_suite = None
if_let = True
else:
room_price = room[0]
if 'single' in room:
room_type = 'single'
elif 'double' in room:
room_type = 'double'
else:
room_type = None
if 'suite' in room:
en_suite = True
else:
en_suite = False
if 'LET' in room:
if_let = True
else:
if_let = False
listing_id = row['listing_id']
number = number+1
room_number = f'{listing_id}-{number}'
Thank you for your ideas!
Upvotes: 1
Views: 999
Reputation: 1614
Split your rooms
column which is a column of list of lists into separate rows, each with a list.
df_new = pd.DataFrame(df['rooms'].tolist()) \
.merge(df, left_index = True, right_index = True) \
.drop('rooms', axis=1) \
.melt(id_vars = ['apt'], value_name = 'rooms') \
.drop('variable', axis=1)
OUTPUT:
apt rooms
12345 ['375', 'LET', '']
12345 ['335', 'LET', '']
12345 ['360', 'LET', '']
12345 ['295', 'double', '']
12345 ['360', 'LET', '']
Now split each element in rooms
into separate columns:
df_new[['price','if_let', 'foo']] = pd.DataFrame(df_new['rooms'].values.tolist(), index=df_new.index)
df_new = df_new.drop(['rooms', 'foo'], axis=1)
OUTPUT:
apt price if_let
12345 375 LET
12345 335 LET
12345 360 LET
12345 295 double
12345 360 LET
If you have unequal number of elements in your list, you can use add_prefix
. This will create new columns, the number of which is equal to the maximum size of lists in the column.
pd.DataFrame(df_new['rooms'].values.tolist(), index=df_new.index).add_prefix('foo_')
You can rename the columns later.
Make new column for room number by grouping on apt
and using cumcount
:
df_new['count'] = df_new.groupby('apt').cumcount()+1
df_new['room_num'] = df_new['apt'].astype(str) + '-' + df_new['count'].astype(str)
OUTPUT:
apt price if_let count room_num
12345 375 LET 1 12345-1
12345 335 LET 2 12345-2
12345 360 LET 3 12345-3
12345 295 double 4 12345-4
12345 360 LET 5 12345-5
You can now modify the columns as you like. For instance:
df_new['if_let] = np.where(df_new['if_let'] == 'LET', True, False)
Try not to use
df.apply
if you have a big dataframe because it will make your operation really slow.
Upvotes: 2