Reputation: 8362
I have a large csv file that I want to group into rows It's about a million lines each which will group to 10000 rows.
Each line of the file is either a comment or starts with a number, followed by a colon, followed by the value that may include more colons.
Each line starting with 0:
indicates the start of a new group, there's a 0:
as the last line, too
Example:
# comment line
# comment line
0:
1:HELLO
2:WORLD
3:1.0
4:5.0
5:TRUE
0:
2:HEY
6:1
7:12
# COMMENT LINE
0:
1: FILE
3: 2.0
10: http://www.google.com
0:
I'm reading the file into a DataFrame like this. (The separator is not perfect, but works with the data I have)
df = pd.read_csv(FILENAME,
sep='(?<=\d):',
comment='#',
names=['col', 'val'],
engine='python')
This results in
col val
0 0
1 1 HELLO
2 2 WORLD
3 3 1.0
4 4 5.0
5 5 TRUE
6 0
7 2 HEY
8 6 1
9 7 12
10 0
11 1 FILE
12 3 2.0
13 10 http://www.google.com
14 0
This should be transformed into
pd.DataFrame([
{1: "HELLO", 2: "WORLD", 3: 1.0, 4: 5.0, 5: "TRUE"},
{2: "HEY", 6: 1, 7: 12},
{1: "FILE", 3: 2.0, 10: "http://www.google.com"}
])
which looks like this
1 2 3 4 5 6 7 10
0 HELLO WORLD 1.0 5.0 TRUE
1 HEY 1.0 12.0
2 FILE 2.0 http://www.google.com
Any hints on how to do this grouping?
Can I use the read_csv c-engine to split the lines by the first colon to speed things up?
Upvotes: 1
Views: 97
Reputation: 1431
After reading your csv data try the following to get the desired output:
new = pd.concat([df.loc[i].set_index('col').T for i in np.split(df.index, np.where(df.col==0)[0])[1:]]).reset_index()
new.columns = new.columns.rename('')
del new['index']
print(new)
Output:
0 1 2 3 4 5 6 7 10
0 NaN HELLO WORLD 1.0 5.0 TRUE NaN NaN NaN
1 NaN NaN HEY NaN NaN NaN 1 12 NaN
2 NaN FILE NaN 2.0 NaN NaN NaN NaN http://www.google.com
Update This might be marginally faster by removing the need of using
.loc
pd.concat([i.T for i in np.split(df.set_index('col'), np.where(df.col == 0)[0])[1:]]).reset_index()
Upvotes: 1