Reputation: 1349
I have a list of strings like the following:
input = ["number__128_alg__hello_min_n__7_max_n__9_full_seq__True_random_color__False_shuffle_shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__hi_min_n__7_max_n__9_full_seq_embedding__False_random_color__False_shuffle_shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__what_random_color__False_shuffle_shapes__False.pkl"]
The format of these strings is parameter name followed by "__", then parameter value. After the parameter value, there is a _ before the next parameter name. It is worth noting that some parameter names contain _ in them (such as "random_shape". Each string has different parameters, but there is overlap. Therefore, I would like to make a data frame with each parameter name as a column, and each row being the values corresponding to each element of the input
list. If the particular value in the list does not have a parameter, the data frame should contain NA or NaN or anything.
How can this be done?
Thanks!
EDIT: If it cannot be done for the original list, what about:
input = ["number__128_alg__hello_min.n__7_max.n__9_full.seq__True_random.color__False_shuffle.shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__hi_min.n__7_max.n__9_full.seq__False_random.color__False_shuffle.shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__what_random.color__False_shuffle.shapes__False.pkl"]
Upvotes: 1
Views: 550
Reputation: 77251
It is possible if you assume values can't contain the _
character (also assuming you want to discard the .pkl
in the end).
input = [
"number__128_alg__hello_min_n__7_max_n__9_full_seq_embedding__True_random_color__False_shuffle_shapes__False.pkl",
"k__9_window__10_number__128_overlap__True_alg__hi_min_n__7_max_n__9_full_seq_embedding__False_random_color__False_shuffle_shapes__False.pkl",
"k__9_window__10_number__128_overlap__True_alg__what_random_color__False_shuffle_shapes__False.pkl"
]
A simple regular expression should do the trick:
import re
data = [dict(re.findall(r"([^_].*?)__([^_]+)", _[:-4])) for _ in input]
print(data)
Result:
[{'number': '128',
'alg': 'hello',
'min_n': '7',
'max_n': '9',
'full_seq_embedding': 'True',
'random_color': 'False',
'shuffle_shapes': 'False'},
{'k': '9',
'window': '10',
'number': '128',
'overlap': 'True',
'alg': 'hi',
'min_n': '7',
'max_n': '9',
'full_seq_embedding': 'False',
'random_color': 'False',
'shuffle_shapes': 'False'},
{'k': '9',
'window': '10',
'number': '128',
'overlap': 'True',
'alg': 'what',
'random_color': 'False',
'shuffle_shapes': 'False'}]
As a dataframe:
import pandas as pd
pd.DataFrame(data)
Upvotes: 2