Reputation: 31
basically i have this excel file that i uploaded to python, i made a new column which identified if a word was in each row and if it was in a row then it would come out as true, if not false. So i have this new column and im trying to find the percentage of true and false. Later i will try to make a table separating all the ones that are true and false. I need help with the percentage one first. I am a beginner like i started this last week
so for the percentage problem i decided to first create a code to count the occurrence of the word "true" and "false" in the column and then i would have just did some math to get the percentages but i didn't get past counting the occurrence. The product of the codes below were 0 and thats not what is suppose to display.
import pandas as pd
import xlrd
df = pd.read_excel (r'C:\New folder\CrohnsD.xlsx')
print (df)
df['has_word_icd'] = df.apply(lambda row: True if
row.str.contains('ICD').any() else False, axis=1)
print(df['has_word_icd'])
#df.to_excel(r'C:\New folder\done.xlsx')
test_str = "df['has_word_icd']"
counter = test_str.count('true')
print (str(counter))
this is the updated version and it still gives me 0, i cannot change df['has_word_icd'] because thats how the variable is introduced initially
import pandas as pd
import xlrd
df = pd.read_excel (r'C:\New folder\CrohnsD.xlsx')
print (df)
df['has_word_icd'] = df.apply(lambda row: True if
row.str.contains('ICD').any() else False, axis=1)
print(df['has_word_icd'])
#df.to_excel(r'C:\New folder\done.xlsx')
test_str = (df['has_word_icd'])
count = 0
for i in range(len(test_str)):
if test_str[i] == 'true':
count += 1
i += 1
print(count)
both gave me the same result
please help me, the output from both codes is "0" and it shouldn't be that. Somebody help me get a code that just directly gives me the percent of the "true" & "false"
Upvotes: 2
Views: 454
Reputation: 3331
Here is a way to do it using a list comprehension. For the percentage, you can use the np.mean()
function:
import numpy as np
df= pd.DataFrame({'a' : ['hello icd', 'bob', 'bob icd', 'hello'],
'b' : ['bye', 'you', 'bob is icd better', 'bob is young']})
df['contains_word_icd'] = df.apply(lambda row :
any([True if 'icd' in row[x] else False for x in df.columns]), axis=1)
percentage = np.mean(df['contains_word_icd'])
# 0.5
Output :
a b contains_word_icd
0 hello icd bye True
1 bob you False
2 bob icd bob is icd better True
3 hello bob is young False
Upvotes: 1
Reputation: 687
The main problem lies here: "df['has_word_icd']"
. You put a variable in quotes which to python means its a plain string. Correct would be
test_str = df[has_word_icd]
Then you loop through the test_str
like so:
count = 0
for i in range(len(test_str)):
if test_str[i] == 'true':
count += 1
i += 1
print(count)
Then get the percentage:
percent = (count / range(len(df[has_word_icd]]) * 100
Upvotes: 0