Reputation: 191
i have a csv file in the following form.
id results_numbers results creation_time
9680 2 [(9394, u'lesbyfaye'), (999, u'Kayts & Koilsby')] 11/10/14 0:23
9690 3 [(5968, u'Jacobsonl'), (47, u'SarHix'), (8825, u'joy')] 12/10/14 0:10
and i would like to read this into pandas and covert to the following:
id results_numbers new_id name creation_time
9680 2 9394 lesbyfaye 11/10/14 0:23
9680 3 999 Kayts & Koilsby 11/10/14 0:23
9690 3 5968 Jacobsonl 12/10/14 0:10
9690 3 47 SarHix 12/10/14 0:10
9690 3 8825 joy 12/10/14 0:10
Upvotes: 1
Views: 59
Reputation: 210832
you can try do it also without looping:
Original DF:
In [184]: df
Out[184]:
creation_time id results \
0 11/10/14 0:23 9680 [(9394, lesbyfaye), (999, Kayts & Koilsby)]
1 12/10/14 0:10 9690 [(5968, Jacobsonl), (47, SarHix), (8825, joy)]
results_number
0 2
1 3
Solution:
In [189]: tmp = (pd.DataFrame.from_dict(df.results.to_dict(), orient='index')
.....: .stack()
.....: .reset_index(level=1, drop=True)
.....: )
In [190]: idx = tmp.index
In [191]: new = (pd.DataFrame(tmp.tolist(), columns=['new_id','name'], index=idx)
.....: .join(df.drop(['results'], axis=1))
.....: )
Result:
In [192]: new
Out[192]:
new_id name creation_time id results_number
0 9394 lesbyfaye 11/10/14 0:23 9680 2
0 999 Kayts & Koilsby 11/10/14 0:23 9680 2
1 5968 Jacobsonl 12/10/14 0:10 9690 3
1 47 SarHix 12/10/14 0:10 9690 3
1 8825 joy 12/10/14 0:10 9690 3
Upvotes: 0
Reputation: 109528
assuming you can read the dataframe:
df = pd.DataFrame({'id': [9680, 9690], 'results_number': [2, 3], 'results': [[(9394, u'lesbyfaye'), (999, u'Kayts & Koilsby')], [(5968, u'Jacobsonl'), (47, u'SarHix'), (8825, u'joy')]], 'creation_time': ["11/10/14 0:23", "12/10/14 0:10"]})
>>>> pd.DataFrame([[row.id, row.results_number, tup[0], tup[1], row.creation_time]
for _, row in df.iterrows()
for tup in row.results],
columns=['id', 'results_numbers', 'new_id', 'name', 'creation_time'])
id results_numbers new_id name creation_time
0 9680 2 9394 lesbyfaye 11/10/14 0:23
1 9680 2 999 Kayts & Koilsby 11/10/14 0:23
2 9690 3 5968 Jacobsonl 12/10/14 0:10
3 9690 3 47 SarHix 12/10/14 0:10
4 9690 3 8825 joy 12/10/14 0:10
EDIT
If you have malformed data, try this:
good_data = []
bad_data = []
for _, row in df.iterrows():
for n, tup in enumerate(row.results):
if len(tup) == 2:
good_data.append([row.id, row.results_number, tup[0], tup[1], row.creation_time])
else:
bad_data.append(n, tup)
Upvotes: 1