axon
axon

Reputation: 45

Excel to pandas, whitespace as row delimeter

I have some badly formatted Excel data that looks like this:

aaaa
bbbb
cccc


aaaa
bbbb
cccc
dddd

Is there a decent way to, using the white space as a delimiter, turn each segment into a pandas row with a default value to fix raggedness?

I'd like the end result to be something like this:

aaaa bbbb cccc "" 
aaaa bbbb cccc dddd

Thanks!

Upvotes: 0

Views: 132

Answers (2)

scotscotmcc
scotscotmcc

Reputation: 3113

I think I was able to get something that works by bringing the column in as a list, and then trying to pull out each section of the list separated by any number of '' values. There may be a better way to do these things, but maybe the idea is helpful (and it does work for my example at least).

The sample data I used here is like yours, with another little chunk at the end. Hopefully I'm understanding it right.

l = ['aaaa',
'bbbb',
'cccc',
'',
'',
'',
'aaaa',
'bbbb',
'cccc',
'dddd',
'',
'asdf',
'badfd'
]

With that list l, the idea is to loop through the list for the number of '' items, and pull out the items between lastspace (the last blank-space value) and thisspace (the current blank-space value).

It needed some special handling for when there are multiple '' values in a row. That is the while thisspace == lastspace+1... piece, which if that happens it will work to skip it and go to the next '' value. Because of this, we want to increment our loop counter when we find these consecutives, and so our loop also has to be a while loop instead of a for loop (which can't handle manual increments).

l.index() gives a ValueError when it can't find the substring, and so when we have gone past the last '', we need this error-handling.

d = []
lastspace = -1
i = 0
while i <= l.count(''):
    try:
        thisspace = l.index('',lastspace+1)
        while thisspace == lastspace+1:
            lastspace = thisspace
            thisspace = l.index('',lastspace+1)
            i += 1    
        d.append(l[lastspace+1:thisspace])
        lastspace = thisspace
        i += 1
    except ValueError:
        d.append(l[lastspace+1:])
        i += 1
df = pd.DataFrame(d)

The dataframe I get in the end looks like this:

      0      1     2     3
0  aaaa   bbbb  cccc  None
1  aaaa   bbbb  cccc  dddd
2  asdf  badfd  None  None

Upvotes: 2

Adrian Ang
Adrian Ang

Reputation: 580

I have used the following text file test.txt to simulate your inputs

aaaa bbbb cccc  aaaa bbbb cccc dddd

Is this correct?

df = pd.read_csv('test.txt', delimiter = ' ', header = None)

    0       1       2       3       4       5       6       7
0   aaaa    bbbb    cccc    NaN     aaaa    bbbb    cccc    dddd

And to further process the dataframe

df = df.fillna('')
print(np.array(df.loc[0,:].to_list()).reshape(-1,4))

[['aaaa' 'bbbb' 'cccc' '']
 ['aaaa' 'bbbb' 'cccc' 'dddd']]

Upvotes: 1

Related Questions