Reputation: 181
I'm needing to separate a row into multiple columns, for a previous post was able to separate that, but some of the rows are empty and because of that, I get this error:
ValueError: Index contains duplicate entries, cannot reshape
here is a sample dataset to mock up this issue:
myData = [['Abc: 9.22 Mno: 6.90 IExplorer 0.00 OCa: 0.00 Foo: 0.00'],
['Abc: 0.61 Mno: 0.14'],
[''],
['MCheese: (37.20) dimes: (186.02) Feria: (1,586.02)'],
['Abc: 16.76 Mno: 4.25 OMG: 63.19'],
['yonka: 19.27'],
['Dome: (552.23)'],
['Fray: 2,584.96'],
['CC: (83.31)'],
[''],
['Abc: 307.34 Mno: 18.40 Feria: 509.67'],
['IExplorer: 26.28 OCa: 26.28 Foo: 730.68'],
['Abc: 122.66 Mno: 11.85 Feria: 213.24'],
[''],
['Wonka: (13.67) Fray: (1,922.48)'],
['Mno: 18.19 IExplorer: 0.00 OCa: 0.00 Foo: 0.00'],
['Abc: 74.06 Mno: 12.34 Feria: 124.42 MCheese: (4.07)'],
[''],
['Abc: 45.96 Mno: 18.98 IExplorer: 0.00 OCa: 0.00 Foo: 0.00'],
['IExplorer: 0.00 OCa: 0.00 Dome: (166.35) Foo: 0.00'],
['']]
df7 = pd.DataFrame(myData)
df7.columns = ['Original']
df7['Original'] = df7['Original'].str.replace(" ","")
df7['Original']
after separating the columns with a regex from a previous post I get results:
df8 = df7['Original'].str.extractall(r'^(.*?):([\(\)(\,)0-9.]+)').reset_index().fillna(0)
df8 = df8.pivot(index='level_0', columns=0, values=1).rename_axis(index=None, columns=None).fillna(0)
df8
this gives me this result:
Abc CC Dome Fries IExplorer MCheese Mno Wonka yonka
0 9.22 0 0 0 0 0 0 0 0
1 0.61 0 0 0 0 0 0 0 0
3 0 0 0 0 0 (37.20) 0 0 0
4 16.76 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 19.27
6 0 0 (552.23) 0 0 0 0 0 0
7 0 0 0 2,584.96 0 0 0 0 0
8 0 (83.31) 0 0 0 0 0 0 0
10 307.34 0 0 0 0 0 0 0 0
11 0 0 0 0 26.28 0 0 0 0
12 122.66 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 (13.67) 0
15 0 0 0 0 0 0 18.19 0 0
16 74.06 0 0 0 0 0 0 0 0
18 45.96 0 0 0 0 0 0 0 0
19 0 0 0 0 0.00 0 0 0 0
if I change the regex the number of columns increase but I do not get the entirety of the dataset. This second part for this particular sample gives me more columns with this last snippet code.
df8 = df7['Original'].str.extractall(r'(.*?):([\(\)(\,)0-9.]+)').reset_index().fillna(0)
df8 = df8.pivot(index='level_0', columns=0, values=1).rename_axis(index=None, columns=None).fillna(0)
df8
Although, in my particular case the first line gives me more columns than the second one. However none of them count the empty rows.
Is there any way I can count those empty rows within the dateset whenever it finds an empty row? in total there 21 rows, I can only get to 19 shown and count.
Upvotes: 1
Views: 46
Reputation: 71687
We can use str.findall
to find all the matching occurrences of regex pattern from each row, then map
the occurrences to dict
and create a new dataframe. This approach will avoid re-indexing the dataframe. Further you also have to fix your regex pattern to properly capture matching pairs.
s = df7['Original'].str.findall(r'([^:0-9]+):\(?([0-9.,]+)\)?')
df_out = pd.DataFrame(map(dict, s), index=s.index).fillna(0)
>>> df_out
Abc Mno OCa Foo MCheese dimes Feria OMG yonka Dome Fray CC IExplorer Wonka
0 9.22 6.90 0.00 0.00 0 0 0 0 0 0 0 0 0 0
1 0.61 0.14 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 37.20 186.02 1,586.02 0 0 0 0 0 0 0
4 16.76 4.25 0 0 0 0 0 63.19 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 19.27 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 552.23 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 2,584.96 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 83.31 0 0
9 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 307.34 18.40 0 0 0 0 509.67 0 0 0 0 0 0 0
11 0 0 26.28 730.68 0 0 0 0 0 0 0 0 26.28 0
12 122.66 11.85 0 0 0 0 213.24 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 1,922.48 0 0 13.67
15 0 18.19 0.00 0.00 0 0 0 0 0 0 0 0 0.00 0
16 74.06 12.34 0 0 4.07 0 124.42 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18 45.96 18.98 0.00 0.00 0 0 0 0 0 0 0 0 0.00 0
19 0 0 0.00 0.00 0 0 0 0 0 166.35 0 0 0.00 0
20 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Upvotes: 1