Reputation: 45
I have some badly formatted Excel data that looks like this:
aaaa
bbbb
cccc
aaaa
bbbb
cccc
dddd
Is there a decent way to, using the white space as a delimiter, turn each segment into a pandas row with a default value to fix raggedness?
I'd like the end result to be something like this:
aaaa bbbb cccc ""
aaaa bbbb cccc dddd
Thanks!
Upvotes: 0
Views: 132
Reputation: 3113
I think I was able to get something that works by bringing the column in as a list, and then trying to pull out each section of the list separated by any number of ''
values. There may be a better way to do these things, but maybe the idea is helpful (and it does work for my example at least).
The sample data I used here is like yours, with another little chunk at the end. Hopefully I'm understanding it right.
l = ['aaaa',
'bbbb',
'cccc',
'',
'',
'',
'aaaa',
'bbbb',
'cccc',
'dddd',
'',
'asdf',
'badfd'
]
With that list l
, the idea is to loop through the list for the number of ''
items, and pull out the items between lastspace
(the last blank-space value) and thisspace
(the current blank-space value).
It needed some special handling for when there are multiple ''
values in a row. That is the while thisspace == lastspace+1...
piece, which if that happens it will work to skip it and go to the next ''
value. Because of this, we want to increment our loop counter when we find these consecutives, and so our loop also has to be a while
loop instead of a for
loop (which can't handle manual increments).
l.index()
gives a ValueError
when it can't find the substring, and so when we have gone past the last ''
, we need this error-handling.
d = []
lastspace = -1
i = 0
while i <= l.count(''):
try:
thisspace = l.index('',lastspace+1)
while thisspace == lastspace+1:
lastspace = thisspace
thisspace = l.index('',lastspace+1)
i += 1
d.append(l[lastspace+1:thisspace])
lastspace = thisspace
i += 1
except ValueError:
d.append(l[lastspace+1:])
i += 1
df = pd.DataFrame(d)
The dataframe I get in the end looks like this:
0 1 2 3
0 aaaa bbbb cccc None
1 aaaa bbbb cccc dddd
2 asdf badfd None None
Upvotes: 2
Reputation: 580
I have used the following text file test.txt
to simulate your inputs
aaaa bbbb cccc aaaa bbbb cccc dddd
Is this correct?
df = pd.read_csv('test.txt', delimiter = ' ', header = None)
0 1 2 3 4 5 6 7
0 aaaa bbbb cccc NaN aaaa bbbb cccc dddd
And to further process the dataframe
df = df.fillna('')
print(np.array(df.loc[0,:].to_list()).reshape(-1,4))
[['aaaa' 'bbbb' 'cccc' '']
['aaaa' 'bbbb' 'cccc' 'dddd']]
Upvotes: 1