Reputation: 1375
I got the data from stdin
like
x
x
x y
x
x y z
...
and I want to create pandas DataFrame
based on that input
df = pd.read_csv(sys.stdin, sep='\t', header=None)
, but the problem here is that my data on third line has more values than the first one and I got
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
The questions is: How can I handle this error when I don't know the longest elements chain(separated by \t
).
Upvotes: 0
Views: 1291
Reputation: 30971
The whole task can be performed in a single instruction, without any counting of elements in each row.
I prepared such an example reading from a string, using io.StringIO:
df = pd.DataFrame([ ln.rstrip().split('\t') for ln in
io.StringIO(txt).readlines() ]).fillna('')
The list comprehension converts each source line into a list of fragments (between Tabs).
Then this list comprehension is the data parameter to pd.DataFrame and note that such a list of rows can contain rows of different length.
I added also fillna('') to convert each NaN into an empty string (you are free to delete it if you wish).
To run the test, I defined the source variable as:
txt = '''x
x
x y
x
x y z
x
x y z v'''
end executed the above code, getting:
0 1 2 3
0 x
1 x
2 x y
3 x
4 x y z
5 x
6 x y z v
In the target version, replace reading from a string with reading from stdin.
Upvotes: 1