user3001937
user3001937

Reputation: 2113

Filtering CSV data using Python/numpy

I am working on CSV file.

            id     gender       disease       read      write    science 
  1.        11       male      cancer, diabetes 34         46         39  
  2.        20       male      diabetes         60         52         61  
  3.        12       male      diabetes         37         44         39  
  4.        16       male      cancer           47         31         36  
  5.         7       male      diabetes         57         54         47  
  6.        21       male      diabetes         44         44         50  
  7.        15       male      diabetes         39         39         26  
  8.        22       male      diabetes         42         39         56  
  9.         9       male      cancer           48         49         44  
 10.        18       male      diabetes         50         33         44  
 11.         5       male      diabetes         47         40          .  
 12.        14       male      diabetes         47         41         42  
 13.         3       male      diabetes         63         65         63  
 14.        24       male         fever         52         62         47  
 15.         8     female      diabetes         39         44         44  
 16.         1     female      cancer           34         44         39  
 17.         4     female      diabetes         44         50         39  
 18.         2     female      diabetes         39         41         42  
 19.        19     female      cancer           28         46         44  
 20.        17     female      diabetes         47         57         44  
 21.         6     female      diabetes         47         41         40  
 22.        10     female      diabetes         47         54         53  
 23.        13     female      diabetes         47         46         47  
 24.        23     female      diabetes         65         65         58  
 25.        25     female    Breast cancer         47         44         42  

I would like to get all the rows where people have cancer. Some people have diabetes and cancer, so I have to filter that as well. The result should be:

1.         11       male      cancer, diabetes 34         46         39  
4.         16       male      cancer           47         31         36
9.         9       male      cancer           48         49         44  
19.        19     female      cancer           28         46         44 
25.        25     female    Breast cancer         47         44         42


import pandas as pd                     
import numpy as np

ppl_ve_cancer = pd.read_csv(join(dirname(__file__), 'data.csv'))
delta= pd.DataFrame.from_records(ppl_ve_cancer )
disease= delta['disease']

Now, how can I filter the "disease list", and after filtering it, how can I get the data in their row ( id,gender,read,write,science)

Upvotes: 2

Views: 3928

Answers (3)

Yeqing Zhang
Yeqing Zhang

Reputation: 1413

This answer will do exact what you need. You only need something like df[df['A'].str.contains("hello")]

import pandas as pd                     
import numpy as np

ppl_ve_cancer = pd.read_csv(join(dirname(__file__), 'data.csv'))
delta = pd.DataFrame.from_records(ppl_ve_cancer )
query = delta['disease'].str.contains('cancer')
delta_filtered = delta[query]
print delta_filtered

Upvotes: 2

Paul H
Paul H

Reputation: 68176

Here's a more pandas-centric way: First you read all the data as a dataframe, create a has cancer column, and then filter on that=

import StringIO
import pandas

datastring = StringIO.StringIO("""\
id,gender,disease,read,write,science
11,male,"cancer,diabetes",34,46,39
20,male,diabetes,60,52,61
12,male,diabetes,37,44,39
16,male,cancer,47,31,36
7,male,diabetes,57,54,47
21,male,diabetes,44,44,50
15,male,diabetes,39,39,26
22,male,diabetes,42,39,56
9,male,cancer,48,49,44
18,male,diabetes,50,33,44
5,male,diabetes,47,40,-999
14,male,diabetes,47,41,42
3,male,diabetes,63,65,63
24,male,fever,52,62,47
8,female,diabetes,39,44,44
1,female,cancer,34,44,39
4,female,diabetes,44,50,39
2,female,diabetes,39,41,42
19,female,cancer,28,46,44
17,female,diabetes,47,57,44
6,female,diabetes,47,41,40
10,female,diabetes,47,54,53
13,female,diabetes,47,46,47
23,female,diabetes,65,65,58
25,female,"Breast cancer",47,44,42
""")

df = pandas.read_csv(datastring, na_values=-999)

# create the `has cancer` column
df['has cancer'] = df.disease.apply(lambda row: 'cancer' in row)

# print the filtered data
print(df[df['has cancer']].to_string())


    id  gender          disease  read  write  science has cancer
0   11    male  cancer,diabetes    34     46       39       True
3   16    male           cancer    47     31       36       True
8    9    male           cancer    48     49       44       True
15   1  female           cancer    34     44       39       True
18  19  female           cancer    28     46       44       True
24  25  female    Breast cancer    47     44       42       True

Upvotes: 2

Aidan
Aidan

Reputation: 767

This will take your CSV file, filter it by lines with cancer in the, and produce variables you can use imidiatly or store for later.

with open("input.csv") as I:
    for line in I:
        if "cancer" in line: #get lines with Cancer
            line = line.replace("\n","") #filter out new line symbols
            pid,gender,disease,read,write,science = line.split('\t') #split lines by tabs then assign to separate variables for later use
            print pid,gender,disease,read,write,science

Input:

id  gender  disease          read    write   science
11  male    cancer, diabetes 34  46  39
20  male    diabetes     60  52  61
12  male    diabetes     37  44  39
16  male    cancer           47  31  36

Output:

11 male cancer, diabetes 34 46 39
16 male cancer           47 31 36

Upvotes: 0

Related Questions