Reputation: 469
so I have this CSV and I would like to do the following:
Original data:
Parsed Data:
So, to put in words, if a column has commas then I want to create a new column with only one value and delete the column which has multiple values.
For example: N2 has I1, I3 and I4. Hence the new data gets 3 columns, containing one value only.
I want to make it dynamic in such a way that all the permutations are reflected. Like in the case of N3 that has 2 places and 2 items.
I am trying to use python's pandas to do this. Some help would be appreciated.
Upvotes: 1
Views: 85
Reputation: 148880
You can avoid the use of pandas. If you want to stick with the standard csv
module, you simply have to split each field on comma (','
) and then iterate on the splitted elements.
Code could be assuming the input delimiter is a semicolon (;
) (I cannot know what it is except it cannot be a comma):
with open('input.csv', newline='') as fd, open('output.csv', 'w', newline='') as fdout:
rd = csv.DictReader(fd, delimiter=';')
wr = csv.writer(fdout)
_ = wr.writerow(rd.fieldnames)
for row in rd:
for i in row['Item'].split(','):
i = i.strip()
if len(i) != 0:
for p in row['Place'].split(','):
p = p.strip()
if len(p) != 0:
for n in row['Name'].split(','):
n = n.strip()
if len(n) != 0:
wr.writerow((n,p,i))
Output is:
Name,Place,Item
N1,P1,I1
N2,P2,I1
N2,P2,I3
N2,P2,I4
N3,P2,I2
N3,P3,I2
N3,P2,I5
N3,P3,I5
Upvotes: 0
Reputation: 186
Here is another option:
df['Place'] = df['Place'].str.split(',')
df['Item'] = df['Item'].str.split(',')
exploded = pd.DataFrame([
a + [p, t] for *a, P, T in df.values
for p in P for t in T
], columns=df.columns)
And the output:
Name Place Item
0 N1 P1 I1
1 N2 P2 I1
2 N2 P2 I3
3 N2 P2 I4
4 N3 P2 I2
5 N3 P2 I5
6 N3 P3 I2
7 N3 P3 I5
Upvotes: 1
Reputation: 143
You can use iterrows()
:
df = pd.DataFrame({'Name': ['N1', 'N2', 'N3'], 'Place':['P1', 'P2','P2,P3'], 'Item':['I1,', 'I1,I3,I4', 'I2,I5']})
result = pd.DataFrame()
new_result = pd.DataFrame()
df['Place'] = df['Place'].apply(lambda x: x.strip(','))
df['Item'] = df['Item'].apply(lambda x: x.strip(','))
for a,b in df.iterrows():
curr_row = df.iloc[a]
temp = ((curr_row['Place'].split(',')))
for x in temp:
curr_row['Place'] = x
result = result.append(curr_row, ignore_index=True)
for a,b in result.iterrows():
curr_row = result.iloc[a]
temp = ((curr_row['Item'].split(',')))
for x in temp:
curr_row['Item'] = x
new_result = new_result.append(curr_row, ignore_index=True)
Output:
Name Place Item
0 N1 P1 I1
1 N2 P2 I1
2 N2 P2 I3
3 N2 P2 I4
4 N3 P2 I2
5 N3 P2 I5
6 N3 P3 I2
7 N3 P3 I5
This is the simplest way you can achieve your desired output.
Upvotes: 0
Reputation: 18201
You are effectively attempting to take the Cartesian product of each row, then binding the result back into a DataFrame
. As such, you could use itertools
and do something like
from itertools import chain, product
df_lists = df.applymap(lambda s: s.split(','))
pd.DataFrame(chain.from_iterable(df_lists.apply(lambda row: product(*row), axis=1)), columns=df.columns)
With your example input:
In [334]: df
Out[334]:
Name Place Item
0 N1 P1 I1
1 N2 P2 I1,I3,I4
2 N3 P2,P3 I2,I5
In [336]: df_lists = df.applymap(lambda s: s.split(','))
In [337]: pd.DataFrame(chain.from_iterable(df_lists.apply(lambda row: product(*row), axis=1)), columns=df.columns)
Out[337]:
Name Place Item
0 N1 P1 I1
1 N2 P2 I1
2 N2 P2 I3
3 N2 P2 I4
4 N3 P2 I2
5 N3 P2 I5
6 N3 P3 I2
7 N3 P3 I5
Upvotes: 0