Wendy Velasquez
Wendy Velasquez

Reputation: 181

re-indexing a dataset that has empty rows that are being transformed as columns using pivot

I'm needing to separate a row into multiple columns, for a previous post was able to separate that, but some of the rows are empty and because of that, I get this error:

ValueError: Index contains duplicate entries, cannot reshape

here is a sample dataset to mock up this issue:

myData = [['Abc: 9.22   Mno: 6.90     IExplorer 0.00  OCa: 0.00  Foo: 0.00'],
          ['Abc: 0.61   Mno: 0.14'], 
          [''],
          ['MCheese: (37.20)  dimes: (186.02)  Feria: (1,586.02)'],
          ['Abc: 16.76   Mno: 4.25     OMG: 63.19'],
          ['yonka: 19.27'],
          ['Dome: (552.23)'],
          ['Fray: 2,584.96'],
          ['CC: (83.31)'],
          [''],
          ['Abc: 307.34   Mno: 18.40    Feria: 509.67'],
          ['IExplorer: 26.28  OCa: 26.28  Foo: 730.68'],
          ['Abc: 122.66   Mno: 11.85     Feria: 213.24'],
          [''],
          ['Wonka: (13.67)  Fray: (1,922.48)'],
          ['Mno: 18.19     IExplorer: 0.00  OCa: 0.00  Foo: 0.00'],
          ['Abc: 74.06   Mno: 12.34    Feria: 124.42   MCheese: (4.07)'],
          [''],
          ['Abc: 45.96   Mno: 18.98     IExplorer: 0.00  OCa: 0.00  Foo: 0.00'],
          ['IExplorer: 0.00  OCa: 0.00  Dome: (166.35)  Foo: 0.00'],
          ['']]

df7 = pd.DataFrame(myData)
df7.columns = ['Original']
df7['Original'] = df7['Original'].str.replace(" ","") 
df7['Original']

after separating the columns with a regex from a previous post I get results:

df8 = df7['Original'].str.extractall(r'^(.*?):([\(\)(\,)0-9.]+)').reset_index().fillna(0)   
df8 = df8.pivot(index='level_0', columns=0, values=1).rename_axis(index=None, columns=None).fillna(0)
df8

this gives me this result:

    Abc    CC    Dome    Fries    IExplorer MCheese Mno Wonka   yonka
0   9.22    0     0        0          0       0       0    0      0
1   0.61    0     0        0          0       0       0    0      0
3   0       0     0        0          0    (37.20)    0    0      0
4   16.76   0     0        0          0       0       0    0      0
5   0       0     0        0          0       0       0    0    19.27
6   0       0   (552.23)   0          0       0       0    0      0
7   0       0     0    2,584.96       0       0       0    0      0
8   0   (83.31)   0        0          0       0       0    0      0
10  307.34  0     0        0          0       0       0    0      0
11  0       0     0        0        26.28     0       0    0      0
12  122.66  0     0        0          0       0       0    0      0
14  0       0     0        0          0       0       0  (13.67)  0
15  0       0     0        0          0       0    18.19    0     0
16  74.06   0     0        0          0       0       0     0     0
18  45.96   0     0        0          0       0       0     0     0
19  0       0     0        0         0.00     0       0     0     0

if I change the regex the number of columns increase but I do not get the entirety of the dataset. This second part for this particular sample gives me more columns with this last snippet code.

df8 = df7['Original'].str.extractall(r'(.*?):([\(\)(\,)0-9.]+)').reset_index().fillna(0)   
df8 = df8.pivot(index='level_0', columns=0, values=1).rename_axis(index=None, columns=None).fillna(0)
df8 

Although, in my particular case the first line gives me more columns than the second one. However none of them count the empty rows.

Is there any way I can count those empty rows within the dateset whenever it finds an empty row? in total there 21 rows, I can only get to 19 shown and count.

Upvotes: 1

Views: 46

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71687

We can use str.findall to find all the matching occurrences of regex pattern from each row, then map the occurrences to dict and create a new dataframe. This approach will avoid re-indexing the dataframe. Further you also have to fix your regex pattern to properly capture matching pairs.

s = df7['Original'].str.findall(r'([^:0-9]+):\(?([0-9.,]+)\)?')
df_out = pd.DataFrame(map(dict, s), index=s.index).fillna(0)

>>> df_out

       Abc    Mno    OCa     Foo MCheese   dimes     Feria    OMG  yonka    Dome      Fray     CC IExplorer  Wonka
0     9.22   6.90   0.00    0.00       0       0         0      0      0       0         0      0         0      0
1     0.61   0.14      0       0       0       0         0      0      0       0         0      0         0      0
2        0      0      0       0       0       0         0      0      0       0         0      0         0      0
3        0      0      0       0   37.20  186.02  1,586.02      0      0       0         0      0         0      0
4    16.76   4.25      0       0       0       0         0  63.19      0       0         0      0         0      0
5        0      0      0       0       0       0         0      0  19.27       0         0      0         0      0
6        0      0      0       0       0       0         0      0      0  552.23         0      0         0      0
7        0      0      0       0       0       0         0      0      0       0  2,584.96      0         0      0
8        0      0      0       0       0       0         0      0      0       0         0  83.31         0      0
9        0      0      0       0       0       0         0      0      0       0         0      0         0      0
10  307.34  18.40      0       0       0       0    509.67      0      0       0         0      0         0      0
11       0      0  26.28  730.68       0       0         0      0      0       0         0      0     26.28      0
12  122.66  11.85      0       0       0       0    213.24      0      0       0         0      0         0      0
13       0      0      0       0       0       0         0      0      0       0         0      0         0      0
14       0      0      0       0       0       0         0      0      0       0  1,922.48      0         0  13.67
15       0  18.19   0.00    0.00       0       0         0      0      0       0         0      0      0.00      0
16   74.06  12.34      0       0    4.07       0    124.42      0      0       0         0      0         0      0
17       0      0      0       0       0       0         0      0      0       0         0      0         0      0
18   45.96  18.98   0.00    0.00       0       0         0      0      0       0         0      0      0.00      0
19       0      0   0.00    0.00       0       0         0      0      0  166.35         0      0      0.00      0
20       0      0      0       0       0       0         0      0      0       0         0      0         0      0

Upvotes: 1

Related Questions