Python Pandas: Grouping a key : value csv import into rows

Question

I have a large csv file that I want to group into rows It's about a million lines each which will group to 10000 rows.

Each line of the file is either a comment or starts with a number, followed by a colon, followed by the value that may include more colons.

Each line starting with 0: indicates the start of a new group, there's a 0: as the last line, too

Example:

# comment line
# comment line
0:
1:HELLO
2:WORLD
3:1.0
4:5.0
5:TRUE
0:
2:HEY
6:1
7:12
# COMMENT LINE
0: 
1: FILE
3: 2.0
10: http://www.google.com
0:

I'm reading the file into a DataFrame like this. (The separator is not perfect, but works with the data I have)

df = pd.read_csv(FILENAME, 
                 sep='(?<=\d):', 
                 comment='#', 
                 names=['col', 'val'], 
                 engine='python')

This results in

    col val
0   0   
1   1   HELLO
2   2   WORLD
3   3   1.0
4   4   5.0
5   5   TRUE
6   0   
7   2   HEY
8   6   1
9   7   12
10  0   
11  1    FILE
12  3    2.0
13  10   http://www.google.com
14  0

This should be transformed into

pd.DataFrame([
    {1: "HELLO", 2: "WORLD", 3: 1.0, 4: 5.0, 5: "TRUE"},
    {2: "HEY", 6: 1, 7: 12},
    {1: "FILE", 3: 2.0, 10: "http://www.google.com"}
])

which looks like this

    1   2   3   4   5   6   7   10
0   HELLO   WORLD   1.0 5.0 TRUE            
1       HEY             1.0 12.0    
2   FILE        2.0                 http://www.google.com

Any hints on how to do this grouping?

Can I use the read_csv c-engine to split the lines by the first colon to speed things up?

Python Pandas: Grouping a key : value csv import into rows

Answers (1)

Related Questions