mixjr365 cool
mixjr365 cool

Reputation: 31

how do i find the occurrence and percentage of occurance of a word in a string ; how to fix error

basically i have this excel file that i uploaded to python, i made a new column which identified if a word was in each row and if it was in a row then it would come out as true, if not false. So i have this new column and im trying to find the percentage of true and false. Later i will try to make a table separating all the ones that are true and false. I need help with the percentage one first. I am a beginner like i started this last week

so for the percentage problem i decided to first create a code to count the occurrence of the word "true" and "false" in the column and then i would have just did some math to get the percentages but i didn't get past counting the occurrence. The product of the codes below were 0 and thats not what is suppose to display.

import pandas as pd
import xlrd
df = pd.read_excel (r'C:\New folder\CrohnsD.xlsx')
print (df)
df['has_word_icd'] = df.apply(lambda row: True if 
row.str.contains('ICD').any() else False, axis=1)
print(df['has_word_icd'])
#df.to_excel(r'C:\New folder\done.xlsx')
test_str = "df['has_word_icd']"
counter = test_str.count('true')
print (str(counter))

this is the updated version and it still gives me 0, i cannot change df['has_word_icd'] because thats how the variable is introduced initially

import pandas as pd
import xlrd
df = pd.read_excel (r'C:\New folder\CrohnsD.xlsx')
print (df)
df['has_word_icd'] = df.apply(lambda row: True if 
row.str.contains('ICD').any() else False, axis=1)
print(df['has_word_icd'])
#df.to_excel(r'C:\New folder\done.xlsx')

test_str = (df['has_word_icd'])

count = 0
for i in range(len(test_str)):
   if test_str[i] == 'true':
        count += 1
  i += 1

print(count)

both gave me the same result

please help me, the output from both codes is "0" and it shouldn't be that. Somebody help me get a code that just directly gives me the percent of the "true" & "false"

Upvotes: 2

Views: 454

Answers (2)

vlemaistre
vlemaistre

Reputation: 3331

Here is a way to do it using a list comprehension. For the percentage, you can use the np.mean() function:

import numpy as np

df= pd.DataFrame({'a' : ['hello icd', 'bob', 'bob icd', 'hello'],
                  'b' : ['bye', 'you', 'bob is icd better', 'bob is young']})

df['contains_word_icd'] = df.apply(lambda row :
                                   any([True if 'icd' in row[x] else False for x in df.columns]), axis=1)
percentage = np.mean(df['contains_word_icd'])
# 0.5

Output :

           a                  b  contains_word_icd
0  hello icd                bye               True
1        bob                you              False
2    bob icd  bob is icd better               True
3      hello       bob is young              False

Upvotes: 1

Nick
Nick

Reputation: 687

The main problem lies here: "df['has_word_icd']". You put a variable in quotes which to python means its a plain string. Correct would be test_str = df[has_word_icd]

Then you loop through the test_str like so:

count  = 0
for i in range(len(test_str)):
  if test_str[i] == 'true':
        count += 1
  i += 1

print(count)

Then get the percentage:

percent = (count / range(len(df[has_word_icd]]) * 100

Upvotes: 0

Related Questions