Reputation: 31073
In a previous post, I found out that pandas read_table()
function can handle variable-lenth whitespace as a delimiter if you use the read_table('datafile', sep=r'\s*')
construction. While this works great for many of my files, it does not work for others despite being highly similar.
EDIT: I had posted examples that could not replicate the problem when other tried. So I am posting links to the original files for AY907538 and AY942707 as well as leaving the error message that I cannot manage to solve.
## filename:AY942707
# this will load with no problem
data = read_table('AY942707.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')
## filename: AY907538
data = read_table('AY907538.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')
which will generate the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-131d10d1fb1d> in <module>()
2
3 #temp = get_dataset('AY907538.hmmdomtblout')
----> 4 data = read_table('AY907538.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')
5 #data = read_table('AY942707.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_table(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze)
282 kwds['encoding'] = None
283
--> 284 return _read(TextParser, filepath_or_buffer, kwds)
285
286 @Appender(_read_fwf_doc)
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(cls, filepath_or_buffer, kwds)
189 return parser
190
--> 191 return parser.get_chunk()
192
193 @Appender(_read_csv_doc)
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows)
779 msg = ('Expecting %d columns, got %d in row %d' %
780 (col_len, zip_len, row_num))
--> 781 raise ValueError(msg)
782
783 data = dict((k, v) for k, v in izip(self.columns, zipped_content))
ValueError: Expecting 26 columns, got 28 in row 6
Upvotes: 0
Views: 772
Reputation: 69266
The last field description of target
in both files holds multiple words. Since white space is used as seperator, description of target
is not treated as a single column by read_table. Each word in this field is in a different column. In AY942707
the first description of target
holds more words than on all of the other lines, this is not the case in AY907538. read_table
determines the number of columns from the first line and all following lines should have equal or less number of columns.
Upvotes: 1