bli
bli

Reputation: 8194

usecols in pandas read_table results in "list index out of range"

I would like to select just 2 columns when parsing some data with pandas.

The help of pd.read_table mentions a usecols option that seems to be exactly what I want:

usecols : array-like, default None
    Return a subset of the columns. All elements in this array must either
    be positional (i.e. integer indices into the document columns) or strings
    that correspond to column names provided either by the user in `names` or
    inferred from the document header row(s). For example, a valid `usecols`
    parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
    results in much faster parsing time and lower memory usage.

My data, once read, appear to have columns numbered from 0 to 6:

In [338]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_
     ...: col=3, header=None)[:3]
Out[338]: 
                0      1      2  4  5    6
3                                         
WBGene00022277  I   4118  10230  -  .   83
WBGene00022276  I  10412  16842  +  .  230
WBGene00022278  I  17482  26781  -  .  303

But when I try to keep only the index (column 3) and the last one (column 6), I get the following error:

In [339]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_
     ...: col=3, header=None, usecols=(3, 6))[:3]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-339-279bef505f16> in <module>()
----> 1 pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_col=3, header=None, usecols=(3, 6))[:3]

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    644                     delim_whitespace=delim_whitespace,
    645                     as_recarray=as_recarray,
--> 646                     warn_bad_lines=warn_bad_lines,
    647                     error_bad_lines=error_bad_lines,
    648                     low_memory=low_memory,

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    387         kwds['encoding'] = encoding
    388 
--> 389     compression = kwds.get('compression')
    390     compression = _infer_compression(filepath_or_buffer, compression)
    391     filepath_or_buffer, _, compression = get_filepath_or_buffer(

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    728 
    729                 if dialect_val != provided:
--> 730                     conflict_msgs.append((
    731                         "Conflicting values for '{param}': '{val}' was "
    732                         "provided, but the dialect specifies '{diaval}'. "

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    921         for arg in _deprecated_args:
    922             parser_default = _c_parser_defaults[arg]
--> 923             msg = ("The '{arg}' argument has been deprecated "
    924                    "and will be removed in a future version."
    925                    .format(arg=arg))

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1445                 cast_type = dtypes
   1446 
-> 1447             if self.na_filter:
   1448                 col_na_values, col_na_fvalues = _get_na_values(
   1449                     c, na_values, na_fvalues)

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _clean_index_names(columns, index_col)
   2812                 msg = ('Expected %d fields in line %d, saw %d' %
   2813                        (col_len, row_num + 1, actual_len))
-> 2814                 if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:
   2815                     # see gh-13374
   2816                     reason = ('Error could possibly be due to quotes being '

IndexError: list index out of range

I had successfully used the usecols option in another case, but with some headers kept from the original file.

What is causing the problem here ?

Edit: header=None is apparently not the problem

I can parse a differently-formatted file, without keeping headers, and the usecols option works:

In [361]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/feature_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", skiprows
     ...: =2, index_col=0, header=None, usecols=[0, 6])[:3]
Out[361]: 
                  6
0                  
WBGene00022277   72
WBGene00022276  222
WBGene00022278  302

Upvotes: 2

Views: 4069

Answers (1)

Jan Zeiseweis
Jan Zeiseweis

Reputation: 3738

I looks like it has to do with the index_col

Try setting the index after reading the file:

path = "../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt"
df = pd.read_table(path, header=None, usecols=(3, 6)).set_index(3)[:3]

Apparently index_col is being used after reducing the columns. You're selecting two columns and then try to select the third one as index.

path = "../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt"
df = pd.read_table(path, header=None, usecols=(3, 6), index_col=0)[:3]

Upvotes: 1

Related Questions