Mick Hawkes
Mick Hawkes

Reputation: 47

Python: Pandas: Is there a quicker way to build this dataframe?

I am trying to build a dataframe that cleans up data from a database. The data has not been normalised properly (out of my control) and has errors.

A typical row of data looks like this: ['BTENFU0', 4.3, 0, 'ARGUS DB583E-SN 750MHz EDT_0']

but i need it to break up the last field so I have: ['BTENFU0', 4.3, 0, 'ARGUS', 'DB583E-SN', '750MHz', 0']

I use an apply function to build up a dataframe, but the table has over 54,000 rows so it takes about 20mins to run.

Is there a faster way to do this? I tried some chaining ideas but I couldnt get split to work properly. Its also complicated because I have to check for specific errors in the data layout.

Here is the code:

def makeExpandedAntTable(df): # - df is a series apparently
    if df.loc['antName'] == 'COMMSCOPE NT-360M-F_2600MHZ EDT_0':
        df.loc['antName'] = 'COMMSCOPE NT-360M-F 2600MHZ EDT_0'
    newlist = df.values.tolist()
    print(newlist[0])

    ant = newlist[3].split()
    if ant[3] == 'EDT_02_5':
        ant[3] = 'EDT_02.5'
    ant.extend(ant[3].split("_"))
    newRow = newlist[:3]
    newRow.extend(ant)
    del newRow[6:8]
    if len(newRow) == 7:
        dfExpandedAnt.loc[len(dfExpandedAnt)] = newRow
    else:
        print('error: missing field in ' + newRow)

--- Main code

ExpandedAntCols = ['Atoll_cell', 'height', 'bearing','  make', 'model', 'freq', 'tilt']

dfExpandedAnt = pd.DataFrame(columns = ExpandedAntCols)
dfAtollTxers = dfAtollTxers.apply(makeExpandedAntTable, axis = 1)

Would using a for loop to build up a list then converting it to a df at the end be faster? or just build the list in the helper function and do the df build in the main code?

Upvotes: 0

Views: 137

Answers (2)

Mick Hawkes
Mick Hawkes

Reputation: 47

I came up with an answer as well, the guy who posted the answer before then deleted it gave me the clues. I wrote this before the new answer above. I'll take a look at it and see if i can now simplify what I have below.

splitted = dfAtollTxers['antName'].str.split(expand=True)
dfWorking = pd.concat([dfAtollTxers, splitted], axis=1)
splitted2 = dfWorking[3].str.split("_",expand=True)
splitted2.drop([0],inplace = True, axis = 1)
splitted2.rename(columns = {1:'tilt'},inplace = True)
dfExpandedAnt = pd.concat([dfWorking, splitted2], axis=1)
dfExpandedAnt.drop([3],inplace = True, axis = 1)
dfExpandedAnt.drop([4],inplace = True, axis = 1)
dfExpandedAnt.drop(['antName'],inplace = True, axis = 1)
dfExpandedAnt.rename(columns = {0:'make',1:'model',2:'freq'},inplace = True)

I cannot belove the insane amount of time this saves. My old method took over 20 minutes to run. This takes a blink!

Upvotes: 0

cosmic_inquiry
cosmic_inquiry

Reputation: 2684

Use str.split and add them as new columns:

df = pd.DataFrame(data=[['BTENFU0', 4.3, 0, 'ARGUS DB583E-SN 750MHz EDT_0'],
                        ['BTENFU0', 4.3, 0, 'ARGUS DB583E-SN 750MHz EDT_0']], 
                        columns=['Atoll_cell', 'height', 'bearing','messed_up_column'])
df[['make', 'model', 'freq', 'tilt']] = pd.DataFrame(df.messed_up_column.str.split().tolist())
df.drop(columns='messed_up_column', inplace=True)
print(df.to_string())

Output df:

  Atoll_cell  height  bearing   make      model    freq   tilt
0    BTENFU0     4.3        0  ARGUS  DB583E-SN  750MHz  EDT_0
1    BTENFU0     4.3        0  ARGUS  DB583E-SN  750MHz  EDT_0

note for tilt you can then do:

df.tilt = df.tilt.str.replace('EDT_','').str.replace('_','.').astype(float)

Would using a for loop to build up a list then converting it to a df at the end be faster? or just build the list in the helper function and do the df build in the main code?

The answer to this is almost always to work with DataFrames and avoid for loops

Upvotes: 1

Related Questions