Reputation: 11
I'm just learning Python and for my first project I am trying to re-format a excel table that I can use on GIS. The Table have many columns with x for each corresponding records. I need to assign (replace the x) with the column names and concatenate all rows separated by commas. I was told that Pandas is a very good library to accomplish this. I did started (see sample code) but I am not sure what to do next. Any help or suggestions will be greatly appreciated. Here is a visual representation of what I am trying to accomplish:
Sample Code:
import pandas as pd
input_excel = r"C:\Projects\... Habitat_table.xlsx" # excel sheet path
excel = pd.read_excel(input_excel, sheet_name = 'Species_habitat') # sheet name
final_dataframe = pd.DataFrame (excel, columns=[‘Habitat_A, ‘Habitat B,C,&D’, ‘Habitat_E']) # every single column name
habitats = [‘Habitat_A, ‘Habitat B,C,&D’, ‘Habitat_E']
for index, row in final_dataframe.iterrows():
final_string = " "
print (final_dataframe.columns.name)
for h in habitats:
print(h)
for c in index:
if h in index.name: #checks if habitat is in column name
print(h)
if row[c] is not null:
final_string == final_string + c.name + ", "
print(final_string)
Upvotes: 1
Views: 183
Reputation: 7627
data_dict = {
'Species_ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Habitat_A': ['X', '', 'X', 'X', '', 'X', 'X', '', 'X', ''],
'Habitat B,C,&D': ['X', '', 'X', 'X', '', 'X', 'X', '', 'X', ''],
'Habitat_E': ['', 'X', '', 'X', 'X', '', 'X', 'X', '', 'X'],
}
df = pd.DataFrame.from_dict(data_dict)
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: pd.Series([y[0] if y[1] == 'X' else '' for y in x.iteritems()]), axis=1)
df['All_habitats(CONCAT)'] = df.apply(lambda x: ','.join(filter(None, x[1:])), axis=1)
print(df)
Prints:
Species_ID Habitat_A Habitat B,C,&D Habitat_E All_habitats(CONCAT)
0 1 Habitat_A Habitat B,C,&D Habitat_A,Habitat B,C,&D
1 2 Habitat_E Habitat_E
2 3 Habitat_A Habitat B,C,&D Habitat_A,Habitat B,C,&D
3 4 Habitat_A Habitat B,C,&D Habitat_E Habitat_A,Habitat B,C,&D,Habitat_E
4 5 Habitat_E Habitat_E
5 6 Habitat_A Habitat B,C,&D Habitat_A,Habitat B,C,&D
6 7 Habitat_A Habitat B,C,&D Habitat_E Habitat_A,Habitat B,C,&D,Habitat_E
7 8 Habitat_E Habitat_E
8 9 Habitat_A Habitat B,C,&D Habitat_A,Habitat B,C,&D
9 10 Habitat_E Habitat_E
import pandas as pd, time
tic = time.perf_counter()
df = pd.read_csv(r'c:\Users\Alex20\Documents\Habitats.csv')
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: pd.Series([y[0] if y[1] == 'X' else '' for y in x.iteritems()]), axis=1)
df['All_habitats(CONCAT)'] = df.apply(lambda x: ','.join(filter(None, x[1:])), axis=1)
print(df)
print(f"Processed in {time.perf_counter() - tic:0.4f} seconds")
Output:
Species_ID ... All_habitats(CONCAT)
0 1 ... HabitatA,HabitatB,HabitatC,HabitatD,HabitatF,H...
1 2 ... HabitatC,HabitatG,HabitatP
2 3 ... HabitatA,HabitatB,HabitatC,HabitatE,HabitatG,H...
3 4 ... HabitatA,HabitatB,HabitatE,HabitatJ,HabitatL,H...
4 5 ... HabitatD,HabitatI,HabitatK,HabitatL,HabitatM,H...
... ... ... ...
2090 2091 ... HabitatA,HabitatB,HabitatE,HabitatF,HabitatG,H...
2091 2092 ... HabitatA,HabitatB,HabitatC,HabitatE,HabitatF,H...
2092 2093 ... HabitatB,HabitatC,HabitatD,HabitatG,HabitatH,H...
2093 2094 ... HabitatC,HabitatF,HabitatG,HabitatI,HabitatK,H...
2094 2095 ... HabitatB,HabitatE,HabitatG,HabitatI,HabitatK,H...
[2095 rows x 19 columns]
Processed in 0.4257 seconds
Upvotes: 2
Reputation: 153
Here is a solution...
import pandas as pd
import numpy as np
data = {'species_id':[1,2,3],
'Habitat_A':['x',np.nan,'x'],
'Habitat_B':['x', np.nan,np.nan],
'Habitat_C':['x', 'x',np.nan],
}
df = pd.DataFrame(data)
display(df)
for col in df.columns:
df[col] = df[col].replace('x', col+ ' ')
df[col] = df[col].replace(np.nan, '')
df['All_Habitats']= df['Habitat_A'].astype(str) + df['Habitat_B'].astype(str) + df['Habitat_C'].astype(str)
display(df)
Output:
species_id Habitat_A Habitat_B Habitat_C All_Habitats
0 1 Habitat_A Habitat_B Habitat_C Habitat_A Habitat_B Habitat_C
1 2 Habitat_C Habitat_C
2 3 Habitat_A Habitat_A
Upvotes: 0