Reputation: 47
I am trying to build a dataframe that cleans up data from a database. The data has not been normalised properly (out of my control) and has errors.
A typical row of data looks like this: ['BTENFU0', 4.3, 0, 'ARGUS DB583E-SN 750MHz EDT_0']
but i need it to break up the last field so I have: ['BTENFU0', 4.3, 0, 'ARGUS', 'DB583E-SN', '750MHz', 0']
I use an apply function to build up a dataframe, but the table has over 54,000 rows so it takes about 20mins to run.
Is there a faster way to do this? I tried some chaining ideas but I couldnt get split to work properly. Its also complicated because I have to check for specific errors in the data layout.
Here is the code:
def makeExpandedAntTable(df): # - df is a series apparently
if df.loc['antName'] == 'COMMSCOPE NT-360M-F_2600MHZ EDT_0':
df.loc['antName'] = 'COMMSCOPE NT-360M-F 2600MHZ EDT_0'
newlist = df.values.tolist()
print(newlist[0])
ant = newlist[3].split()
if ant[3] == 'EDT_02_5':
ant[3] = 'EDT_02.5'
ant.extend(ant[3].split("_"))
newRow = newlist[:3]
newRow.extend(ant)
del newRow[6:8]
if len(newRow) == 7:
dfExpandedAnt.loc[len(dfExpandedAnt)] = newRow
else:
print('error: missing field in ' + newRow)
ExpandedAntCols = ['Atoll_cell', 'height', 'bearing',' make', 'model', 'freq', 'tilt']
dfExpandedAnt = pd.DataFrame(columns = ExpandedAntCols)
dfAtollTxers = dfAtollTxers.apply(makeExpandedAntTable, axis = 1)
Would using a for loop to build up a list then converting it to a df at the end be faster? or just build the list in the helper function and do the df build in the main code?
Upvotes: 0
Views: 137
Reputation: 47
I came up with an answer as well, the guy who posted the answer before then deleted it gave me the clues. I wrote this before the new answer above. I'll take a look at it and see if i can now simplify what I have below.
splitted = dfAtollTxers['antName'].str.split(expand=True)
dfWorking = pd.concat([dfAtollTxers, splitted], axis=1)
splitted2 = dfWorking[3].str.split("_",expand=True)
splitted2.drop([0],inplace = True, axis = 1)
splitted2.rename(columns = {1:'tilt'},inplace = True)
dfExpandedAnt = pd.concat([dfWorking, splitted2], axis=1)
dfExpandedAnt.drop([3],inplace = True, axis = 1)
dfExpandedAnt.drop([4],inplace = True, axis = 1)
dfExpandedAnt.drop(['antName'],inplace = True, axis = 1)
dfExpandedAnt.rename(columns = {0:'make',1:'model',2:'freq'},inplace = True)
I cannot belove the insane amount of time this saves. My old method took over 20 minutes to run. This takes a blink!
Upvotes: 0
Reputation: 2684
Use str.split
and add them as new columns:
df = pd.DataFrame(data=[['BTENFU0', 4.3, 0, 'ARGUS DB583E-SN 750MHz EDT_0'],
['BTENFU0', 4.3, 0, 'ARGUS DB583E-SN 750MHz EDT_0']],
columns=['Atoll_cell', 'height', 'bearing','messed_up_column'])
df[['make', 'model', 'freq', 'tilt']] = pd.DataFrame(df.messed_up_column.str.split().tolist())
df.drop(columns='messed_up_column', inplace=True)
print(df.to_string())
Output df:
Atoll_cell height bearing make model freq tilt
0 BTENFU0 4.3 0 ARGUS DB583E-SN 750MHz EDT_0
1 BTENFU0 4.3 0 ARGUS DB583E-SN 750MHz EDT_0
note for tilt you can then do:
df.tilt = df.tilt.str.replace('EDT_','').str.replace('_','.').astype(float)
Would using a for loop to build up a list then converting it to a df at the end be faster? or just build the list in the helper function and do the df build in the main code?
The answer to this is almost always to work with DataFrames and avoid for loops
Upvotes: 1