Eular
Eular

Reputation: 1817

Python fill missing data

So I have a table that is kind of data like this

theta    phi    x   y

0       0      1   2
0       1      2   3
--------------------
90      360    4   5

theta values runs from 0 to 90 and for each theta phi runs from 0 to 360, but some of the phi s are missing here and there in the table and I have to fill those values. I'm trying to use pandas for this job like this,

import pandas as pd
cols=['theta','phi','x','y']
data = pd.read_csv('data.dat', sep=" |\t", header=None,names=cols,engine='python')


def fill_up(i):
    df=data[data['theta']==i]
    df.set_index('phi',inplace=True)
    df= df.reindex(range(0,361)).reset_index()
    df=df[cols]
    df.interpolate(inplace=True)
    return df

df=pd.concat([fill_up(i) for i in xrange(0,91)])
df.to_csv("new.txt",sep=' ', index=False, header=False)

as you can see I'm creating a seperate data frame for each theta values and concatnating them finally. Is there any way to achieve this without creating a different dataframe or achieving this more efficiently? Also should I take care for any memory overflow as the datafile can be several mb?

Upvotes: 2

Views: 514

Answers (1)

Ben.T
Ben.T

Reputation: 29635

I think you can do it by setting 'theta' and 'phi' as index with set_index, then reindex with pd.MultiIndex.from_product with all the values of 'theta' and 'phi' you expect, fill nan values with interpolate and finally reset_index such as:

new_data = (data.set_index(['theta','phi'])
                 .reindex(pd.MultiIndex.from_product( [range(91),range(361)],
                                                      names=['theta','phi']))
                 .interpolate().reset_index())

Note the range(91) and range(361) to create all the values for theta and phi.

EDIT for blank line, you can do:

new_data = (data.set_index(['theta','phi'])
                 .reindex(pd.MultiIndex.from_product( [range(91),range(362)],
                                                      names=['theta','phi']))
                 .interpolate().reset_index())
new_data.loc[new_data['phi'] ==361] = ''

Note that the range for phi is range(362) to add a row that you make "blank" with the next command line, replacing all the rows where phi = 361 by blank

Or, with the new_data as in the original solution, you can create a blank dataframe with specific indexes that you can concat to new_data and sort_index after

blanck_frame = pd.DataFrame(data='', columns=new_data.columns
                            index=new_data.loc[new_data['phi'] ==360].index+0.5)
new_data = pd.concat([new_data,blanck_frame]).sort_index()
# you can add  .reset_index(drop=True) at the end if you want integer indexes

Upvotes: 1

Related Questions