Reputation: 8194
I would like to select just 2 columns when parsing some data with pandas.
The help of pd.read_table
mentions a usecols
option that seems to be exactly what I want:
usecols : array-like, default None
Return a subset of the columns. All elements in this array must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid `usecols`
parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
results in much faster parsing time and lower memory usage.
My data, once read, appear to have columns numbered from 0 to 6:
In [338]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_
...: col=3, header=None)[:3]
Out[338]:
0 1 2 4 5 6
3
WBGene00022277 I 4118 10230 - . 83
WBGene00022276 I 10412 16842 + . 230
WBGene00022278 I 17482 26781 - . 303
But when I try to keep only the index (column 3) and the last one (column 6), I get the following error:
In [339]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_
...: col=3, header=None, usecols=(3, 6))[:3]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-339-279bef505f16> in <module>()
----> 1 pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_col=3, header=None, usecols=(3, 6))[:3]
/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
644 delim_whitespace=delim_whitespace,
645 as_recarray=as_recarray,
--> 646 warn_bad_lines=warn_bad_lines,
647 error_bad_lines=error_bad_lines,
648 low_memory=low_memory,
/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
387 kwds['encoding'] = encoding
388
--> 389 compression = kwds.get('compression')
390 compression = _infer_compression(filepath_or_buffer, compression)
391 filepath_or_buffer, _, compression = get_filepath_or_buffer(
/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
728
729 if dialect_val != provided:
--> 730 conflict_msgs.append((
731 "Conflicting values for '{param}': '{val}' was "
732 "provided, but the dialect specifies '{diaval}'. "
/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
921 for arg in _deprecated_args:
922 parser_default = _c_parser_defaults[arg]
--> 923 msg = ("The '{arg}' argument has been deprecated "
924 "and will be removed in a future version."
925 .format(arg=arg))
/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1445 cast_type = dtypes
1446
-> 1447 if self.na_filter:
1448 col_na_values, col_na_fvalues = _get_na_values(
1449 c, na_values, na_fvalues)
/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _clean_index_names(columns, index_col)
2812 msg = ('Expected %d fields in line %d, saw %d' %
2813 (col_len, row_num + 1, actual_len))
-> 2814 if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:
2815 # see gh-13374
2816 reason = ('Error could possibly be due to quotes being '
IndexError: list index out of range
I had successfully used the usecols
option in another case, but with some headers kept from the original file.
What is causing the problem here ?
header=None
is apparently not the problemI can parse a differently-formatted file, without keeping headers, and the usecols
option works:
In [361]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/feature_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", skiprows
...: =2, index_col=0, header=None, usecols=[0, 6])[:3]
Out[361]:
6
0
WBGene00022277 72
WBGene00022276 222
WBGene00022278 302
Upvotes: 2
Views: 4069
Reputation: 3738
I looks like it has to do with the index_col
Try setting the index after reading the file:
path = "../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt"
df = pd.read_table(path, header=None, usecols=(3, 6)).set_index(3)[:3]
Apparently index_col
is being used after reducing the columns. You're selecting two columns and then try to select the third one as index.
path = "../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt"
df = pd.read_table(path, header=None, usecols=(3, 6), index_col=0)[:3]
Upvotes: 1