Rich Signell
Rich Signell

Reputation: 16365

Associating units with Pandas DataFrame

I'm using a web service that returns a CSV response in which the 1st row contains the column names, and the 2nd row contains the column units, for example:

longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114

I can read this into a Pandas DataFrame:

import pandas as pd
from StringIO import StringIO

x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''

# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",\s*")

print(obs)

which produces

      longitude       latitude
0  degrees_east  degrees_north
1      -142.842          -1.82
2       -25.389          39.87
3       -37.704         27.114

But what would be the best approach to associate the units with the DataFrame columns for later use, for example labeling plots?

Upvotes: 4

Views: 5314

Answers (2)

Dan Neuman
Dan Neuman

Reputation: 166

read_csv will create a multi-index from two header rows:

import pandas as pd
from io import StringIO

x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''

# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",", header=[0,1])
obs.rename_axis(columns=['Values', 'Units'], inplace=True)

print(obs)

which produces:

Values    longitude      latitude
Units  degrees_east degrees_north
0          -142.842        -1.820
1           -25.389        39.870
2           -37.704        27.114

Selecting a top-level column gets:

In [1]: obs.longitude
Out[1]: 
Units  degrees_east
0          -142.842
1           -25.389
2           -37.704

If you want to remove the units entirely, it's probably easier to create a new dataframe:

nobs = pd.DataFrame(data=obs.values, 
                    columns=obs.columns.levels[0], 
                    dtype=float)

print(nobs)

giving:

Values  latitude  longitude
0       -142.842     -1.820
1        -25.389     39.870
2        -37.704     27.114

Upvotes: 0

unutbu
unutbu

Reputation: 879481

Allowing pandas to read the second line as data is screwing up the dtype for the columns. Instead of a float dtype, the presence of strings make the dtype of the columns object, and the underlying objects, even the numbers, are strings. This screws up all numerical operations:

In [8]: obs['latitude']+obs['longitude']
Out[8]: 
0    degrees_northdegrees_east
1                -1.82-142.842
2                 39.87-25.389
3                27.114-37.704

In [9]: obs['latitude'][1]
Out[9]: '-1.82'

So it is imperative that pd.read_csv skip the second line.

The following is pretty ugly, but given the format of the input, I don't see a better way.

import pandas as pd
from StringIO import StringIO

x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''

content = StringIO(x.strip())

def read_csv(content):
    columns = next(content).strip().split(',')
    units = next(content).strip().split(',')
    obs = pd.read_table(content, sep=",\s*", header=None)
    obs.columns = ['{c} ({u})'.format(c=col, u=unit)
                   for col, unit in zip(columns, units)]
    return obs

obs = read_csv(content)
print(obs)
#    longitude (degrees_east)  latitude (degrees_north)
# 0                  -142.842                    -1.820
# 1                   -25.389                    39.870
# 2                   -37.704                    27.114
print(obs.dtypes)
# longitude (degrees_east)    float64
# latitude (degrees_north)    float64

Upvotes: 3

Related Questions