Thomas
Thomas

Reputation: 181

More efficient Iteration in Pandas and Python

I have hundreds of text files with thousands of records that look like this (6 GB of weather data from NOAA): '0175690150931212019010100567+34300-116166FM-15+06...' Each position in the string has unique info and I separate it into columns with the function below:

def convert_to_df(path):
    df = pd.read_csv(path, low_memory= False, header=None, names= ['Data'])
    df_clean = pd.DataFrame(columns = ['TVC','USAF','WBAN','DATE','TIME','SOURCE','LAT','LONG','TYPE','ELEV','FWSID',
                                       'MPOQC','WIND_ANGLE',...)
    for record, data in enumerate(df['Data']):
        TVC = df['Data'][record][0:4]
        USAF = df['Data'][record][4:10]
        WBAN = df['Data'][record][10:15]
        DATE = df['Data'][record][15:23]
        TIME = df['Data'][record][23:27]
        ...
        ATM_PRESSURE = df['Data'][record][99:104]
        ATM_PRESSURE_QC = df['Data'][record][104:105]
        
        clean_dict = {'TVC':TVC,'USAF':USAF,'WBAN':WBAN,'DATE':DATE,
              'TIME':TIME,'SOURCE':SOURCE,'LAT':LAT,'LONG':LONG,'TYPE':TYPE,'ELEV':ELEV,...}
        df_clean = df_clean.append(clean_dict, ignore_index = True)

It ends up looking something like this, though much more rows and 31 columns:

TVC USAF WBAN DATE TIME
0175 690150 93121 20190101 0056
0175 690150 93121 20190101 0156

I am using multiprocessing but it takes quite sometime to get through everything (20 hours or so). Just curious if there is a more efficient/ faster way to separate out the positional data in each string into columns.

Upvotes: 1

Views: 93

Answers (1)

ddejohn
ddejohn

Reputation: 8952

You can perform string slicing on the entire initial DataFrame at once:

TVC = df["Data"].str[0:4]
USAF = df["Data"].str[4:10]
WBAN = df["Data"].str[10:15]
.
.
.

This is tedious, so you could take your column names and a list of slice boundaries:

columns = ['TVC', 'USAF', 'WBAN', 'DATE', 'TIME', 'SOURCE', ...]
slices = [0, 4, 10, 15, 23, 27, ...]
slices = zip(slices, slices[1:])  # (0, 4), (4, 10), (10, 15), ...

clean_dict = {}
for key, (start, end) in zip(columns, slices):
    clean_dict[key] = df["Data"].str[start:end]

Upvotes: 3

Related Questions