Learner
Learner

Reputation: 5302

How to read unstructured csv in pandas

I have got a messy csv file (just extension is csv). But when i open this file in ms excel with ; delimited it looks like as below(dummy sample)-

I investigated this file and found following:

  1. Some column has name and others does not have.
  2. The length of row is variable but contains newline char to trigger next line start.

Question:

How can i read this table in pandas whereas all existing columns(headers) remain and blank columns are filled with consecutive numbers caring variable length of rows.

In fact i want to take 8 cell-value again and again until any row exhausts. from the header-less columns for analysis.

N.B. I have tried usecols,names,skiprows,sep etc in read_csv but with no success

data

EDIT

Added sample input and expected output (formatting is worse but pandas.read_clipboard() should work)

INPUT

car_id   car_type    entry_gate  entry_time(ms)  exit_gate   exit_time(ms)   traveled_dist(m)    avg_speed(m/s)  trajectory(x[m]    y[m]    speed[m/s]  a_tangential[ms-2]  a_lateral[ms-2] timestamp[ms]   )                                           
24   Bus    25  4300    26  48520   118.47  2.678999    509552.78   5039855.59  10.074  0.429   0.2012  0   509552.97   5039855.57  10.0821 0.3853  0.2183  20                      
25   Car    25  20  26  45900   113.91  2.482746    509583.7    5039848.78  4.5344  -0.1649 0.2398  0   509583.77   5039848.71                                      
26   Car     -   -   -   -  109.68  8.859805    509572.75   5039862.75  4.0734  -0.7164 -0.1066 0   509572.67   5039862.76  4.0593  -0.7021 -0.1141 20  509553.17   5039855.55  10.0886 0.2636  0.2356  40
27   Car     -   -   -   -  119.84  3.075936    509582.73   5039862.78  1.191   0.5247  0.0005  0   509582.71   5039862.78  1.2015  0.5322                              
28   Car     -   -   -   -  129.64  4.347466    509591.07   5039862.9   1.6473  0.1987  -0.0033 0   509591.04   5039862.89  1.6513  0.2015  -0.0036 20  

Expected OUTPUT(dataframe)

car_id   car_type    entry_gate  entry_time(ms)  exit_gate   exit_time(ms)   traveled_dist(m)    avg_speed(m/s)  trajectory(x[m]    y[m]    speed[m/s]  a_tangential[ms-2]  a_lateral[ms-2] timestamp[ms]   1   2   3   4   5   6   7   8   9   10  11  12
24   Bus    25  4300    26  48520   118.47  2.678999    509552.78   5039855.59  10.074  0.429   0.2012  0   509552.97   5039855.57  10.0821 0.3853  0.2183  20                      
25   Car    25  20  26  45900   113.91  2.482746    509583.7    5039848.78  4.5344  -0.1649 0.2398  0   509583.77   5039848.71                                      
26   Car     -   -   -   -  109.68  8.859805    509572.75   5039862.75  4.0734  -0.7164 -0.1066 0   509572.67   5039862.76  4.0593  -0.7021 -0.1141 20  509553.17   5039855.55  10.0886 0.2636  0.2356  40
27   Car     -   -   -   -  119.84  3.075936    509582.73   5039862.78  1.191   0.5247  0.0005  0   509582.71   5039862.78  1.2015  0.5322                              
28   Car     -   -   -   -  129.64  4.347466    509591.07   5039862.9   1.6473  0.1987  -0.0033 0   509591.04   5039862.89  1.6513  0.2015  -0.0036 20      

Upvotes: 4

Views: 5542

Answers (1)

jezrael
jezrael

Reputation: 863166

Preprocessing

Function get_names() open file, check max length of splitted rows. Then I read first row and add missing values from max length.

Last value of first row is ), so I remove it by firstline[:-1] and then I add to range missing columns by +1 rng = range(1, m - lenfirstline + 2). +2 is because range starts from value 1.

Then you can use function read_csv, skipp first line and as names use output from get_names().

import pandas as pd
import csv
 
#preprocessing
def get_names():
    with open('test/file.txt', 'r') as csvfile:
        reader = csv.reader(csvfile)
        num = []
        for i, row in enumerate(reader):
            if i ==0:
                firstline = ''.join(row).split()
                lenfirstline = len(firstline)
                #print firstline, lenfirstline
            num.append(len(''.join(row).split()))
        m = max(num)
        rng = range(1, m - lenfirstline + 2)
        #remove )
        rng = firstline[:-1] + rng
        return rng

#names is list return from function
df = pd.read_csv('test/file.txt', sep="\s+", names=get_names(), index_col=[0], skiprows=1)
#temporaly display 10 rows and 30 columns
with pd.option_context('display.max_rows', 10, 'display.max_columns', 30):
    print df
        
       car_type entry_gate entry_time(ms) exit_gate exit_time(ms)  \
car_id                                                              
24          Bus         25           4300        26         48520   
25          Car         25             20        26         45900   
26          Car          -              -         -             -   
27          Car          -              -         -             -   
28          Car          -              -         -             -   

        traveled_dist(m)  avg_speed(m/s)  trajectory(x[m]        y[m]  \
car_id                                                                  
24                118.47        2.678999        509552.78  5039855.59   
25                113.91        2.482746        509583.70  5039848.78   
26                109.68        8.859805        509572.75  5039862.75   
27                119.84        3.075936        509582.73  5039862.78   
28                129.64        4.347466        509591.07  5039862.90   

        speed[m/s]  a_tangential[ms-2]  a_lateral[ms-2]  timestamp[ms]  \
car_id                                                                   
24         10.0740              0.4290           0.2012              0   
25          4.5344             -0.1649           0.2398              0   
26          4.0734             -0.7164          -0.1066              0   
27          1.1910              0.5247           0.0005              0   
28          1.6473              0.1987          -0.0033              0   

                1           2        3       4       5   6          7  \
car_id                                                                  
24      509552.97  5039855.57  10.0821  0.3853  0.2183  20        NaN   
25      509583.77  5039848.71      NaN     NaN     NaN NaN        NaN   
26      509572.67  5039862.76   4.0593 -0.7021 -0.1141  20  509553.17   
27      509582.71  5039862.78   1.2015  0.5322     NaN NaN        NaN   
28      509591.04  5039862.89   1.6513  0.2015 -0.0036  20        NaN   

                 8        9      10      11  12  
car_id                                           
24             NaN      NaN     NaN     NaN NaN  
25             NaN      NaN     NaN     NaN NaN  
26      5039855.55  10.0886  0.2636  0.2356  40  
27             NaN      NaN     NaN     NaN NaN  
28             NaN      NaN     NaN     NaN NaN  

Postprocessing

First you have to estimate max number of columns N. I know their real number is 26, so I estimate to N = 30. Function read_csv with parameter name = range(N) return NaN columns, what are difference between estimated and real length of columns.

After dropping you can select first row with columns names, where are not NaN (I remove last column ) by [:-1] ) - df1.loc[0].dropna()[:-1]. Then you can append new Series with range from 1 to length of NaN values in first row. Last first row is removed by subset of df.

#set more as estimated number of columns
N = 30
    
df1 = pd.read_csv('test/file.txt', sep="\s+", names=range(N))

df1 = df1.dropna(axis=1, how='all')  #drop columns with all NaN

df1.columns = df1.loc[0].dropna()[:-1].append(pd.Series(range(1, len(df1.columns) - len(df1.loc[0].dropna()[:-1]) + 1 )))

#remove first line with uncomplete column names
df1 = df1.ix[1:]
print df1.head()

Upvotes: 2

Related Questions