Reputation: 181
I have hundreds of text files with thousands of records that look like this (6 GB of weather data from NOAA): '0175690150931212019010100567+34300-116166FM-15+06...' Each position in the string has unique info and I separate it into columns with the function below:
def convert_to_df(path):
df = pd.read_csv(path, low_memory= False, header=None, names= ['Data'])
df_clean = pd.DataFrame(columns = ['TVC','USAF','WBAN','DATE','TIME','SOURCE','LAT','LONG','TYPE','ELEV','FWSID',
'MPOQC','WIND_ANGLE',...)
for record, data in enumerate(df['Data']):
TVC = df['Data'][record][0:4]
USAF = df['Data'][record][4:10]
WBAN = df['Data'][record][10:15]
DATE = df['Data'][record][15:23]
TIME = df['Data'][record][23:27]
...
ATM_PRESSURE = df['Data'][record][99:104]
ATM_PRESSURE_QC = df['Data'][record][104:105]
clean_dict = {'TVC':TVC,'USAF':USAF,'WBAN':WBAN,'DATE':DATE,
'TIME':TIME,'SOURCE':SOURCE,'LAT':LAT,'LONG':LONG,'TYPE':TYPE,'ELEV':ELEV,...}
df_clean = df_clean.append(clean_dict, ignore_index = True)
It ends up looking something like this, though much more rows and 31 columns:
TVC | USAF | WBAN | DATE | TIME |
---|---|---|---|---|
0175 | 690150 | 93121 | 20190101 | 0056 |
0175 | 690150 | 93121 | 20190101 | 0156 |
I am using multiprocessing but it takes quite sometime to get through everything (20 hours or so). Just curious if there is a more efficient/ faster way to separate out the positional data in each string into columns.
Upvotes: 1
Views: 93
Reputation: 8952
You can perform string slicing on the entire initial DataFrame at once:
TVC = df["Data"].str[0:4]
USAF = df["Data"].str[4:10]
WBAN = df["Data"].str[10:15]
.
.
.
This is tedious, so you could take your column names and a list
of slice boundaries:
columns = ['TVC', 'USAF', 'WBAN', 'DATE', 'TIME', 'SOURCE', ...]
slices = [0, 4, 10, 15, 23, 27, ...]
slices = zip(slices, slices[1:]) # (0, 4), (4, 10), (10, 15), ...
clean_dict = {}
for key, (start, end) in zip(columns, slices):
clean_dict[key] = df["Data"].str[start:end]
Upvotes: 3