Reputation: 2113
I am working on CSV file.
id gender disease read write science
1. 11 male cancer, diabetes 34 46 39
2. 20 male diabetes 60 52 61
3. 12 male diabetes 37 44 39
4. 16 male cancer 47 31 36
5. 7 male diabetes 57 54 47
6. 21 male diabetes 44 44 50
7. 15 male diabetes 39 39 26
8. 22 male diabetes 42 39 56
9. 9 male cancer 48 49 44
10. 18 male diabetes 50 33 44
11. 5 male diabetes 47 40 .
12. 14 male diabetes 47 41 42
13. 3 male diabetes 63 65 63
14. 24 male fever 52 62 47
15. 8 female diabetes 39 44 44
16. 1 female cancer 34 44 39
17. 4 female diabetes 44 50 39
18. 2 female diabetes 39 41 42
19. 19 female cancer 28 46 44
20. 17 female diabetes 47 57 44
21. 6 female diabetes 47 41 40
22. 10 female diabetes 47 54 53
23. 13 female diabetes 47 46 47
24. 23 female diabetes 65 65 58
25. 25 female Breast cancer 47 44 42
I would like to get all the rows where people have cancer. Some people have diabetes and cancer, so I have to filter that as well. The result should be:
1. 11 male cancer, diabetes 34 46 39
4. 16 male cancer 47 31 36
9. 9 male cancer 48 49 44
19. 19 female cancer 28 46 44
25. 25 female Breast cancer 47 44 42
import pandas as pd
import numpy as np
ppl_ve_cancer = pd.read_csv(join(dirname(__file__), 'data.csv'))
delta= pd.DataFrame.from_records(ppl_ve_cancer )
disease= delta['disease']
Now, how can I filter the "disease list", and after filtering it, how can I get the data in their row ( id,gender,read,write,science)
Upvotes: 2
Views: 3928
Reputation: 1413
This answer will do exact what you need. You only need something like df[df['A'].str.contains("hello")]
import pandas as pd
import numpy as np
ppl_ve_cancer = pd.read_csv(join(dirname(__file__), 'data.csv'))
delta = pd.DataFrame.from_records(ppl_ve_cancer )
query = delta['disease'].str.contains('cancer')
delta_filtered = delta[query]
print delta_filtered
Upvotes: 2
Reputation: 68176
Here's a more pandas-centric way: First you read all the data as a dataframe, create a has cancer
column, and then filter on that=
import StringIO
import pandas
datastring = StringIO.StringIO("""\
id,gender,disease,read,write,science
11,male,"cancer,diabetes",34,46,39
20,male,diabetes,60,52,61
12,male,diabetes,37,44,39
16,male,cancer,47,31,36
7,male,diabetes,57,54,47
21,male,diabetes,44,44,50
15,male,diabetes,39,39,26
22,male,diabetes,42,39,56
9,male,cancer,48,49,44
18,male,diabetes,50,33,44
5,male,diabetes,47,40,-999
14,male,diabetes,47,41,42
3,male,diabetes,63,65,63
24,male,fever,52,62,47
8,female,diabetes,39,44,44
1,female,cancer,34,44,39
4,female,diabetes,44,50,39
2,female,diabetes,39,41,42
19,female,cancer,28,46,44
17,female,diabetes,47,57,44
6,female,diabetes,47,41,40
10,female,diabetes,47,54,53
13,female,diabetes,47,46,47
23,female,diabetes,65,65,58
25,female,"Breast cancer",47,44,42
""")
df = pandas.read_csv(datastring, na_values=-999)
# create the `has cancer` column
df['has cancer'] = df.disease.apply(lambda row: 'cancer' in row)
# print the filtered data
print(df[df['has cancer']].to_string())
id gender disease read write science has cancer
0 11 male cancer,diabetes 34 46 39 True
3 16 male cancer 47 31 36 True
8 9 male cancer 48 49 44 True
15 1 female cancer 34 44 39 True
18 19 female cancer 28 46 44 True
24 25 female Breast cancer 47 44 42 True
Upvotes: 2
Reputation: 767
This will take your CSV file, filter it by lines with cancer in the, and produce variables you can use imidiatly or store for later.
with open("input.csv") as I:
for line in I:
if "cancer" in line: #get lines with Cancer
line = line.replace("\n","") #filter out new line symbols
pid,gender,disease,read,write,science = line.split('\t') #split lines by tabs then assign to separate variables for later use
print pid,gender,disease,read,write,science
Input:
id gender disease read write science
11 male cancer, diabetes 34 46 39
20 male diabetes 60 52 61
12 male diabetes 37 44 39
16 male cancer 47 31 36
Output:
11 male cancer, diabetes 34 46 39
16 male cancer 47 31 36
Upvotes: 0