576i
576i

Reputation: 8362

Python Pandas: Grouping a key : value csv import into rows

I have a large csv file that I want to group into rows It's about a million lines each which will group to 10000 rows.

Each line of the file is either a comment or starts with a number, followed by a colon, followed by the value that may include more colons.

Each line starting with 0: indicates the start of a new group, there's a 0: as the last line, too

Example:

# comment line
# comment line
0:
1:HELLO
2:WORLD
3:1.0
4:5.0
5:TRUE
0:
2:HEY
6:1
7:12
# COMMENT LINE
0: 
1: FILE
3: 2.0
10: http://www.google.com
0:

I'm reading the file into a DataFrame like this. (The separator is not perfect, but works with the data I have)

df = pd.read_csv(FILENAME, 
                 sep='(?<=\d):', 
                 comment='#', 
                 names=['col', 'val'], 
                 engine='python')

This results in

    col val
0   0   
1   1   HELLO
2   2   WORLD
3   3   1.0
4   4   5.0
5   5   TRUE
6   0   
7   2   HEY
8   6   1
9   7   12
10  0   
11  1    FILE
12  3    2.0
13  10   http://www.google.com
14  0

This should be transformed into

pd.DataFrame([
    {1: "HELLO", 2: "WORLD", 3: 1.0, 4: 5.0, 5: "TRUE"},
    {2: "HEY", 6: 1, 7: 12},
    {1: "FILE", 3: 2.0, 10: "http://www.google.com"}
])

which looks like this

    1   2   3   4   5   6   7   10
0   HELLO   WORLD   1.0 5.0 TRUE            
1       HEY             1.0 12.0    
2   FILE        2.0                 http://www.google.com

Any hints on how to do this grouping?

Can I use the read_csv c-engine to split the lines by the first colon to speed things up?

Upvotes: 1

Views: 97

Answers (1)

gyx-hh
gyx-hh

Reputation: 1431

After reading your csv data try the following to get the desired output:

new = pd.concat([df.loc[i].set_index('col').T for i in np.split(df.index, np.where(df.col==0)[0])[1:]]).reset_index()
new.columns = new.columns.rename('')
del new['index']
print(new)

Output:

    0   1       2       3     4     5    6      7    10
0   NaN HELLO   WORLD   1.0   5.0   TRUE NaN    NaN  NaN
1   NaN NaN     HEY     NaN   NaN   NaN  1      12   NaN
2   NaN FILE    NaN     2.0   NaN   NaN  NaN    NaN  http://www.google.com

Update This might be marginally faster by removing the need of using .loc

pd.concat([i.T for i in np.split(df.set_index('col'), np.where(df.col == 0)[0])[1:]]).reset_index()

Upvotes: 1

Related Questions