Reputation: 31
I am reading a csv file in pandas without headers. My problem is that when i hard code values in usecols and names it works fine. But when i take input from cols and names list which are taken as input from json files, the column names and rows are mismatched. I'm really struck at this issue from a long time.
//This doesn't work.
temp = pd.read_csv(file_name, header=None ,
skiprows=[0], engine='python',
sep=' |,',
usecols=cols,
names=names)
// But this works
temp = pd.read_csv(file_name, header=None ,
skiprows=[0], engine='python',
sep=' |,',
usecols=[10,12,14],
names=['a', 'b', 'c'])
Consider this csv :
X-Y-Z,12.67,India,
A-X-o,11.43,China,
D-F-L,6.09,Kenya.
Now running this code
cols = [2,3,1]
names = ['price','country','code']
temp = pd.read_csv('test.csv', header=None ,
skiprows=[0], engine='python',
sep=' |,',
usecols = cols,
names= names)
produces...
price country code
0 A-X-o 11.43 China
1 D-F-L 6.09 Kenya
which is wrong as country's are not decimals
Upvotes: 2
Views: 13771
Reputation: 895
import pandas as pd
fields = ['a', 'b','c']
#always use `skipinitialspace` which remove the spaces in the header for reading specific columns
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
EDIT:
You are doing it in a wrong way usecols as [2,3,1]
and [1,2,3]
both are same it is not going to change the order,what matters is how you specify names
import pandas as pd
df = {'letters': ['X-Y-Z','A-X-o','D-F-L'],
'num': [12.67,11.43,6.09],
'Country':['India','China','Kenya']}
df = pd.DataFrame(df)
df.to_csv('df.csv', header=False)
Let's consider some scenario
#this is what you have done
cols = [2,3,1]
names = ['price','country','code']
temp = pd.read_csv('df.csv', header=None ,skiprows=[0], engine='python', sep=',',usecols = cols, names= names)
OUTPUT
price country code
1 A-X-o 11.43 China
2 D-F-L 6.09 Kenya
Lets change the cols to [1,2,3]
and there will be no change in output
cols = [1,2,3]
names = ['price','country','code']
temp = pd.read_csv('df.csv', header=None ,skiprows=[0], engine='python', sep=',',usecols = cols, names= names)
OUTPUT
price country code
1 A-X-o 11.43 China
2 D-F-L 6.09 Kenya
Now lets change the cols
to [2,3,1]
or [1,2,3]
and names
to ['code','price','country']
cols = [2,3,1]
names = ['code','price','country']
temp = pd.read_csv('df.csv', header=None ,skiprows=[0], engine='python', sep=',', usecols = cols, names= names)
OUTPUT
code price country
0 A-X-o 11.43 China
1 D-F-L 6.09 Kenya
So basically just switching the cols to [2,3,1] and assigning its names accordingly as ['price','country','code'] will not give you different results, because usecols[2,3,1]
== usecols[1,2,3]
Upvotes: 4
Reputation: 32021
use simply file name with the location it will work docs
temp = pd.read_csv(file_name,usecols= ['a', 'b', 'c'])
on therhand you could pull all the columns csv file and drop the un used column like below
temp=pd.read_csv(file_name)
temp.drop(['col1','col2'...'coln'], axis=1, inplace=True)
Upvotes: 1