Python: Pandas: Is there a quicker way to build this dataframe?

Question

I am trying to build a dataframe that cleans up data from a database. The data has not been normalised properly (out of my control) and has errors.

A typical row of data looks like this: ['BTENFU0', 4.3, 0, 'ARGUS DB583E-SN 750MHz EDT_0']

but i need it to break up the last field so I have: ['BTENFU0', 4.3, 0, 'ARGUS', 'DB583E-SN', '750MHz', 0']

I use an apply function to build up a dataframe, but the table has over 54,000 rows so it takes about 20mins to run.

Is there a faster way to do this? I tried some chaining ideas but I couldnt get split to work properly. Its also complicated because I have to check for specific errors in the data layout.

Here is the code:

def makeExpandedAntTable(df): # - df is a series apparently
    if df.loc['antName'] == 'COMMSCOPE NT-360M-F_2600MHZ EDT_0':
        df.loc['antName'] = 'COMMSCOPE NT-360M-F 2600MHZ EDT_0'
    newlist = df.values.tolist()
    print(newlist[0])

    ant = newlist[3].split()
    if ant[3] == 'EDT_02_5':
        ant[3] = 'EDT_02.5'
    ant.extend(ant[3].split("_"))
    newRow = newlist[:3]
    newRow.extend(ant)
    del newRow[6:8]
    if len(newRow) == 7:
        dfExpandedAnt.loc[len(dfExpandedAnt)] = newRow
    else:
        print('error: missing field in ' + newRow)

--- Main code

ExpandedAntCols = ['Atoll_cell', 'height', 'bearing','  make', 'model', 'freq', 'tilt']

dfExpandedAnt = pd.DataFrame(columns = ExpandedAntCols)
dfAtollTxers = dfAtollTxers.apply(makeExpandedAntTable, axis = 1)

Would using a for loop to build up a list then converting it to a df at the end be faster? or just build the list in the helper function and do the df build in the main code?

cosmic_inquiry · Accepted Answer

Use str.split and add them as new columns:

df = pd.DataFrame(data=[['BTENFU0', 4.3, 0, 'ARGUS DB583E-SN 750MHz EDT_0'],
                        ['BTENFU0', 4.3, 0, 'ARGUS DB583E-SN 750MHz EDT_0']], 
                        columns=['Atoll_cell', 'height', 'bearing','messed_up_column'])
df[['make', 'model', 'freq', 'tilt']] = pd.DataFrame(df.messed_up_column.str.split().tolist())
df.drop(columns='messed_up_column', inplace=True)
print(df.to_string())

Output df:

  Atoll_cell  height  bearing   make      model    freq   tilt
0    BTENFU0     4.3        0  ARGUS  DB583E-SN  750MHz  EDT_0
1    BTENFU0     4.3        0  ARGUS  DB583E-SN  750MHz  EDT_0

note for tilt you can then do:

df.tilt = df.tilt.str.replace('EDT_','').str.replace('_','.').astype(float)

Would using a for loop to build up a list then converting it to a df at the end be faster? or just build the list in the helper function and do the df build in the main code?

The answer to this is almost always to work with DataFrames and avoid for loops

Python: Pandas: Is there a quicker way to build this dataframe?

--- Main code

Answers (2)

Related Questions