Alex
Alex

Reputation: 1375

Handle unknown number of columns while pandas csv read

I got the data from stdin like

x
x
x    y
x
x    y    z
...

and I want to create pandas DataFrame based on that input

df = pd.read_csv(sys.stdin, sep='\t', header=None)

, but the problem here is that my data on third line has more values than the first one and I got

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

The questions is: How can I handle this error when I don't know the longest elements chain(separated by \t).

Upvotes: 0

Views: 1291

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30971

The whole task can be performed in a single instruction, without any counting of elements in each row.

I prepared such an example reading from a string, using io.StringIO:

df = pd.DataFrame([ ln.rstrip().split('\t') for ln in
    io.StringIO(txt).readlines() ]).fillna('')

The list comprehension converts each source line into a list of fragments (between Tabs).

Then this list comprehension is the data parameter to pd.DataFrame and note that such a list of rows can contain rows of different length.

I added also fillna('') to convert each NaN into an empty string (you are free to delete it if you wish).

To run the test, I defined the source variable as:

txt = '''x
x
x   y
x
x   y   z
x
x   y   z   v'''

end executed the above code, getting:

   0  1  2  3
0  x         
1  x         
2  x  y      
3  x         
4  x  y  z   
5  x         
6  x  y  z  v

In the target version, replace reading from a string with reading from stdin.

Upvotes: 1

Related Questions