auslander
auslander

Reputation: 548

Create two new fields at once in pandas dataframe based off of calculations of other fields

I am iterating over a series of csv files as dataframes, eventually writing them all out to a common excel workbook.

In one of the many files, there are decimal GPS values (latitude, longitude) split into two columns (df[4] and df[5]) that I'm converting to degrees-minutes-seconds. That method returns a tuple that I'm attempting to park in two new fields called dmslat and dmslon in the same row of the original dataframe:

def convert_dd_to_dms(lat, lon):
    # does the math here
    return dmslat, dmslon

csv_dir = askdirectory()  # tkinter directory picker
os.chdir(csv_dir)
for f in glob.iglob("*.csv"):
    (csv_path, csv_name) = os.path.split(f)
    (csv_prefix, csv_ext) = os.path.splitext(csv_name)
    if csv_prefix[-3:] == "loc":
        df = pd.read_csv(f)
        df['dmslat'] = None
        df['dmslon'] = None
        for i, row in df.iterrows():
            fixed_coords = convert_dd_to_dms(row[4], row[5])
            row['dmslat'] = fixed_coords[0]
            row['dmslon'] = fixed_coords[1]
        print(df)
# process the other files 

So when I use a print() statement I can see the coords are properly calculated but they are not being committed to the dmslat/dmslon fields.

I have also tried assigning the new fields within the row iterator, but since I am at the row scale, it ends up overwriting the entire column with the new calculated value every time.

How can I get the results to (succinctly) populate the columns?

Upvotes: 0

Views: 47

Answers (2)

Roy2012
Roy2012

Reputation: 12503

I think you better use apply rather than iterrows.

Here's a solution that is based on apply. I replaced your location calculation with a function named 'foo' which does some arbitrary calculation from two fields 'a' and 'b' to new values for 'a' and 'b'.

df = pd.DataFrame({"a": range(10), "b":range(10, 20)})
def foo(row):
    return (row["a"] + row["b"], row["a"] * row["b"])

new_df = df.apply(foo, axis=1).apply(pd.Series)

In the above code block, applying 'foo' returns a tuple for every row. Using apply again with pd.Series turns it into a data frame.

df[["a", "b"]] = new_df
df.head(3) 

    a   b
0   10  0
1   23  132
2   38  336

Upvotes: 0

Rexovas
Rexovas

Reputation: 478

It would appear that df.iterrows() is resulting in a "copy" of each row, thus when you add/update the columns "dmslat" and "dmslon", you are modifying the copy, not the original dataframe. This can be confirmed by printing "row" after your assignments. You will see the row item was successfully updated, but the changes are not reflected in the original dataframe.

To modify the original dataframe, you can modify your code as such:

        for i, row in df.iterrows():
            fixed_coords = convert_dd_to_dms(row[4], row[5])
            df.loc[i, 'dmslat'] = fixed_coords[0]
            df.loc[i, 'dmslon'] = fixed_coords[1]
        print(df)

using df.loc guarantees the changes are made to the original dataframe.

Upvotes: 1

Related Questions