Reputation: 1939
I have some csv data pasted below. I would like to parse it and load it to a dataframe so that it is easier to analyze.
I want to grab the values based on each grouping of the logStreamName like so:
df = pd.read_csv('mydata.csv')
logs = df['logStreamName'].unique()
for i in logs:
grouped_df = df[df['logStreamName'] == i]
But then how do I parse each subsetted dataframe to get the associated values
CSV data:
message,logStreamName
20/10/07 17:40:42 - INFO - dse_run_model - n_i*n_j*n_k*n_l: 247632,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:40:42 - INFO - dse_model_assets - n_i*n_j*n_k*n_l = 247632,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:40:42 - INFO - dse_run_model - len(placed_ijkl): 40944,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:40:42 - INFO - dse_run_model - len(placed_region_ijl): 1706,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:40:42 - INFO - dse_run_model - len(not_placed_region_ijl): 1706,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:41:01 - INFO - __main__ - Maximum memory usage: 12258.98828125,data-science-dse-cplex/default/27f44fce-90f8-40b2-83f7-3e1aef216fa6
20/10/07 17:40:24 - INFO - dse_run_model - n_i*n_j*n_k*n_l: 323680,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:40:24 - INFO - dse_model_assets - n_i*n_j*n_k*n_l = 323680,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:40:24 - INFO - dse_run_model - len(placed_ijkl): 59280,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:40:24 - INFO - dse_run_model - len(placed_region_ijl): 2964,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:40:24 - INFO - dse_run_model - len(not_placed_region_ijl): 2964,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:41:01 - INFO - __main__ - Maximum memory usage: 12313.5390625,data-science-dse-cplex/default/11c5884b-f7c5-4600-99d2-70584036ba3d
20/10/07 17:40:24 - INFO - dse_run_model - n_i*n_j*n_k*n_l: 301312,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
20/10/07 17:40:24 - INFO - dse_model_assets - n_i*n_j*n_k*n_l = 301312,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
20/10/07 17:40:25 - INFO - dse_run_model - len(placed_ijkl): 44128,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
20/10/07 17:40:25 - INFO - dse_run_model - len(placed_region_ijl): 2758,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
20/10/07 17:40:25 - INFO - dse_run_model - len(not_placed_region_ijl): 2758,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
20/10/07 17:41:07 - INFO - __main__ - Maximum memory usage: 12286.75,data-science-dse-cplex/default/cb304e99-2c5f-4a13-b454-32de8e1370e2
Final output:
d = {'n_i*n_j*n_k*n_l': [247632, 323680, 301312], 'len(placed_ijkl)': [40944, 59280, 44128],
'len(placed_region_ijl)':[1706, 2964, 2758], 'len(not_placed_region_ijl)': [1706, 2964, 2758],
'Maximum memory usage': [12258.98828125, 12313.5390625, 12286.75]}
df = pd.DataFrame(data=d)
Upvotes: 0
Views: 50
Reputation: 13407
You can use a regular expression to capture the relevant bits out of the info column. Then use pivot
to create the final output:
df[["id", "value"]] = df["message"].str.extract(".*-\s.*-\s(?P<id>.*)(?:\:\s|\s=\s)(?P<value>(?:\d+|\d+\.\d+)$)")
out = df.drop_duplicates(["logStreamName", "id"]).pivot(index="logStreamName", columns="id", values="value")
print(out)
id Maximum memory usage len(not_placed_region_ijl) len(placed_ijkl) len(placed_region_ijl) n_i*n_j*n_k*n_l
logStreamName
data-science-dse-... 12313.5390625 2964 59280 2964 323680
data-science-dse-... 12258.98828125 1706 40944 1706 247632
data-science-dse-... 12286.75 2758 44128 2758 301312
Upvotes: 1