Reputation: 168
I have a pandas dataframe and I am cleaning the data using applymap on all elements using a custom function and storing the cleaned values in separate column.
tag0 tag1 tag2 tag3
1.Kharif 3.Pest and Disease Management 4.Grasshopper 2.Paddy
1.Kharif 2.Brinjal 3.Crop Growth Management
1.Kharif 3.Pest and Disease Management 4.Caterpillar 2.Black Gram
1.Kharif 3.Pest and Disease Management 4.Caterpillar 2.Cotton
The above is a part of the entire dataframe.
I wrote the function written below.
def tag_cleaner(tag):
'''
this function takes an argument called tag and checks if it starts with 1 then
it puts it in a new column called season and so on. It is performed row-wise
and at the end the dataframe will have columnar values
'''
if tag.startswith('1'):
df_tags['season'] = tag
elif tag.startswith('2'):
df_tags['crop'] = tag
elif tag.startswith('3'):
df_tags['maintopic'] = tag
elif tag.startswith('4'):
df_tags['subtopic'] = tag
elif tag.startswith('5'):
df_tags['issue'] = tag
else:
return tag
and then applied the applymap function
df_tags.applymap(tag_cleaner)
I expect the output to be like this
season crop maintopic subtopic
1. Kharif 2.Paddy 3. Pest and Disease Management 4. Grasshopper
1. Kharif 2. Brinjal 3. Crop Growth Management NA
1. Kharif 2. Black Gram 3. Pest and Disease Management 4. Catterpillar
1. Kharif 2. Cotton 3. Pest and Disease Management 4. Catterpillar
The command is able to make the new columns as I wanted, but in all the columns there is same value. It looks like this. This is the same value copied in the entire dataframe.
season crop maintopic subtopic
1.Kharif 2.Paddy 3.Crop Growth Management 4. Caterpillar
But I am getting this error
AttributeError: ("'float' object has no attribute 'startswith'", 'occurred at index tag2')
I am a beginner and I do not know where am I mistake. I think I am making a logical error in the function that I have defined which is why the last run of the function copies the value to the entire series of the dataframe. Please help.
Upvotes: 3
Views: 813
Reputation: 863176
Use:
#reshape DataFrame with remove original columns names
df = df.stack().to_frame('a').reset_index(level=1, drop=True).reset_index()
#get values before .
df['b'] = df['a'].str.split('.').str[0]
#dictionary for new columns names
d = {'1': 'season', '2': 'crop', '3': 'maintopic', '4':'subtopic','5':'issue'}
#pivoting and get new columns names
df = df.pivot('index','b','a').rename(columns=d).rename_axis(None, axis=1).rename_axis(None)
print (df)
season crop maintopic subtopic
0 1.Kharif 2.Paddy 3.Pest and Disease Management 4.Grasshopper
1 1.Kharif 2.Brinjal 3.Crop Growth Management NaN
2 1.Kharif 2.Black Gram 3.Pest and Disease Management 4.Caterpillar
3 1.Kharif 2.Cotton 3.Pest and Disease Management 4.Caterpillar
EDIT: Error means here there is multiple values with same number per rows, solution is use pivot_table
with aggregtae function join
:
print (df)
tag0 tag1 tag2 \
0 1.Kharif 1.Pest and Disease Management 4.Grasshopper
1 1.Kharif 2.Brinjal 3.Crop Growth Management
2 1.Kharif 3.Pest and Disease Management 4.Caterpillar
3 1.Kharif 3.Pest and Disease Management 4.Caterpillar
tag3
0 2.Paddy
1 NaN
2 2.Black Gram
3 2.Cotton
df = df.stack().to_frame('a').reset_index(level=1, drop=True).reset_index()
df['b'] = df['a'].str.split('.').str[0]
d = {'1': 'season', '2': 'crop', '3': 'maintopic', '4':'subtopic','5':'issue'}
df = df.pivot_table(index='index',columns='b',values='a', aggfunc=','.join).rename(columns=d).rename_axis(None, axis=1).rename_axis(None)
print (df)
season crop \
0 1.Kharif,1.Pest and Disease Management 2.Paddy
1 1.Kharif 2.Brinjal
2 1.Kharif 2.Black Gram
3 1.Kharif 2.Cotton
maintopic subtopic
0 NaN 4.Grasshopper
1 3.Crop Growth Management NaN
2 3.Pest and Disease Management 4.Caterpillar
3 3.Pest and Disease Management 4.Caterpillar
Upvotes: 2