Reputation: 85
I am using pandas to read and analyse large NHS prescription datasets. They are individually saved as csv files in a folder. I want to apply the same couple of lines of code to each of these files individually - so I want to individually read each dataframe with pandas and then extract relevant data and perform a basic calculation.
This is my code so far:
import pandas as pd
import glob
path = "/Volumes/TOSHIBA EXT/Datasets/2015"
all_files = glob.glob(path + "/*.CSV")
for f in all_files:
pd.read_csv(f,index_col=None, header=0, usecols=[2,4,5], names=['PRACTICE','BNF NAME', 'ITEMS'])
f=f[f['BNF NAME'].str.contains('Ampicillin' and 'Amoxicillin' and 'Co-Amoxiclav')]
print pd.to_numeric(f['ITEMS']).sum()
However the following error is coming up..
TypeError: string indices must be integers, not str
I would ideally like to define a function that selects the relevant rows containing the strings: "Ampicillin", "Amoxicillin" and "Co-Amoxiclav"; and then totals the number of items for each of these (i.e totalling the column named 'ITEMS') which I could then use in a for loop to iterate through each file.
Would appreciate any pointers as to how to avoid this error and achieve the above.
Many thanks! :)
Upvotes: 1
Views: 2262
Reputation: 32728
There are two issues in your code:
1) The return value of pd.read_csv isn't stored in a variable. That's why you're getting a TypeError
- you're trying to operate on f
(the filename) as if it were a DataFrame
.
2) The filter doesn't work because the expression evaluates from the inside out. To solve this you can create a list of options and then use isin
to test if the target is in the list like so:
df["BNF Name"].isin(['Ampicillin', 'Amoxicillin', 'Co-Amoxiclav'])
Upvotes: 1
Reputation: 142651
f
is filename but you use it as DataFrame - f['BNF NAME']
.
You need
df = pd.read_csv(...)
and then you can use
df['BNF NAME']
df = df[ df['BNF NAME'] ... ]
df['ITEMS']
Upvotes: 2