Str4y_c4tz
Str4y_c4tz

Reputation: 11

parsing through and editing csv with pandas

I'm trying to parse through all of the cells in a csv file that represent heights and round what's after the decimal to match a number in a list (to round down to the nearest inch). After a few days of banging my head against the wall, this is the coding I've been able to get working:

import math
import pandas as pd

inch = [.0, .08, .16, .25, .33, .41, .50, .58, .66, .75, .83, .91, 1]

df = pd.read_csv("sample_csv.csv")


def to_number(s):
    for index, row in df.iterrows():
        try:
            num = float(s)
            num = math.modf(num)
            num = list(num)
            for i,j in enumerate(inch):
                if num[0] < j:
                    num[0] = inch[i-1]
                    break

                elif num[0] == j:
                    num[0] = inch[i]
                    break
            newnum = num[0] + num[1]
            return newnum
        except ValueError:
            return s


df = df.apply(lambda f : to_number(f[0]), axis=1).fillna('')
with open('new.csv', 'a') as f:
    df.to_csv(f, index=False)

Ideally I'd like to have it parse over an entire CSV with n headers, ignoring all strings and round the floats to match the list. Is there a simple(r) way to achieve this with Pandas? And would it be possible (or a good idea?) to have it edit the existing excel workbook instead of creating a new csv i'd have to copy/paste over?

Any help or suggestions would be greatly appreciated as I'm very new to Pandas and it's pretty god damn intimidating!

Upvotes: 1

Views: 98

Answers (2)

CJR
CJR

Reputation: 3985

Adding onto the other answer to address your problem with strings:

# Break the dataframe with a string
df = pd.DataFrame(np.random.uniform(5.1, 6.9, size=(10,3)))
df.ix[0,0] = 'str'

# Find out which things can be cast to numerics and put NaNs everywhere else
df_safe = df.apply(pd.to_numeric, axis=0, errors="coerce")
df_safe = (np.fix(df_safe) + np.round(12*(df_safe - np.fix(df_safe)))/12).round(2)

# Replace all the NaNs with the original data
df_safe[df_safe.isnull()] = df[df_safe.isnull()]

df_safe should be what you want. Despite the name, this isn't particularly safe and there are probably edge conditions that will be a problem.

Upvotes: 0

arra
arra

Reputation: 136

Helping would be a lot easier if you include a sample mock of the data you're trying to parse. To clarify the points you don't specify, as I understand it

  • By "an entire CSV with n headers, ignoring all strings and round the floats to match the list" you mean some n-column dataframe with k numeric columns each of which describe someone's height in inches.
  • The entries in the numeric columns are measured in units of feet.
  • You want to ignore the non-numeric columns and transform the data as 6.14 -> 6 feet, 1 inches (I'm implicitly assuming that by "round down" you want an integer floor; i.e. 6.14 feet is 6 feet, 0.14*12 = 1.68 inches; it's up to you whether this is floored or rounded to the nearest integer).

Now for a subset of random heights measured in feet sampled uniformly over 5.1 feet and 6.9 feet, we could do the following:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(np.random.uniform(5.1, 6.9, size=(10,3)))

In [4]: df
Out[4]:
      0         1         2
0  6.020613  6.315707  5.413499
1  5.942232  6.834540  6.761765
2  5.715405  6.162719  6.363224
3  6.416955  6.511843  5.512515
4  6.472462  5.789654  5.270047
5  6.370964  5.509568  6.113121
6  6.353790  6.466489  5.460961
7  6.526039  5.999284  6.617608
8  6.897215  6.016648  5.681619
9  6.886359  5.988068  5.575993

In [5]: np.fix(df) + np.floor(12*(df - np.fix(df)))/12
Out[5]:
      0         1         2
0  6.000000  6.250000  5.333333
1  5.916667  6.833333  6.750000
2  5.666667  6.083333  6.333333
3  6.416667  6.500000  5.500000
4  6.416667  5.750000  5.250000
5  6.333333  5.500000  6.083333
6  6.333333  6.416667  5.416667
7  6.500000  5.916667  6.583333
8  6.833333  6.000000  5.666667
9  6.833333  5.916667  5.500000

We're using np.fix to extract the integral part of the height value. Likewise, df - np.fix(df) represents the fractional remainder in feet or in inches when multiplied by 12. np.floor just truncates this to the nearest inch below, and the final division by 12 returns the unit of measurement from inches to feet.

You can change np.floor to np.round to get an answer rounded to the nearest inch rather than truncated to the previous whole inch. Finally, you can specify the precision of the output to insist that the decimal portion is selected from your list.

In [6]: (np.fix(df) + np.round(12*(df - np.fix(df)))/12).round(2)
Out[6]:
  0     1     2
0  6.58  5.25  6.33
1  5.17  6.42  5.67
2  6.42  5.83  6.33
3  5.92  5.67  6.33
4  6.83  5.25  6.58
5  5.83  5.50  6.92
6  6.83  6.58  6.25
7  5.83  5.33  6.50
8  5.25  6.00  6.83
9  6.42  5.33  5.08

Upvotes: 2

Related Questions