Reputation: 16365
I'm using a web service that returns a CSV response in which the 1st row contains the column names, and the 2nd row contains the column units, for example:
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
I can read this into a Pandas DataFrame:
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",\s*")
print(obs)
which produces
longitude latitude
0 degrees_east degrees_north
1 -142.842 -1.82
2 -25.389 39.87
3 -37.704 27.114
But what would be the best approach to associate the units with the DataFrame columns for later use, for example labeling plots?
Upvotes: 4
Views: 5314
Reputation: 166
read_csv
will create a multi-index from two header rows:
import pandas as pd
from io import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",", header=[0,1])
obs.rename_axis(columns=['Values', 'Units'], inplace=True)
print(obs)
which produces:
Values longitude latitude
Units degrees_east degrees_north
0 -142.842 -1.820
1 -25.389 39.870
2 -37.704 27.114
Selecting a top-level column gets:
In [1]: obs.longitude
Out[1]:
Units degrees_east
0 -142.842
1 -25.389
2 -37.704
If you want to remove the units entirely, it's probably easier to create a new dataframe:
nobs = pd.DataFrame(data=obs.values,
columns=obs.columns.levels[0],
dtype=float)
print(nobs)
giving:
Values latitude longitude
0 -142.842 -1.820
1 -25.389 39.870
2 -37.704 27.114
Upvotes: 0
Reputation: 879481
Allowing pandas to read the second line as data is screwing up the dtype for the columns. Instead of a float
dtype, the presence of strings make the dtype of the columns object
, and the underlying objects, even the numbers, are strings. This screws up all numerical operations:
In [8]: obs['latitude']+obs['longitude']
Out[8]:
0 degrees_northdegrees_east
1 -1.82-142.842
2 39.87-25.389
3 27.114-37.704
In [9]: obs['latitude'][1]
Out[9]: '-1.82'
So it is imperative that pd.read_csv
skip the second line.
The following is pretty ugly, but given the format of the input, I don't see a better way.
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
content = StringIO(x.strip())
def read_csv(content):
columns = next(content).strip().split(',')
units = next(content).strip().split(',')
obs = pd.read_table(content, sep=",\s*", header=None)
obs.columns = ['{c} ({u})'.format(c=col, u=unit)
for col, unit in zip(columns, units)]
return obs
obs = read_csv(content)
print(obs)
# longitude (degrees_east) latitude (degrees_north)
# 0 -142.842 -1.820
# 1 -25.389 39.870
# 2 -37.704 27.114
print(obs.dtypes)
# longitude (degrees_east) float64
# latitude (degrees_north) float64
Upvotes: 3