Pandas: read.csv() - read only rows with certain column length

Question

I have large .csv files with ~50k rows that have 2,000 columns worth of data that is then followed by ~200k rows that contain 6,000 columns.

I'm importing multiple files with:

pd.concat((pd.read_csv(f,index_col='Unnamed: 0', error_bad_lines=False) for f in file_list))

This can either give an error on the first row of 6,000 columns that it expects 2,000 only or by adding error_bad_lines=False it skips over all 6,000 column rows.

I'm happy in this situation to just discard all 2,000 column rows if necessary. Is there some way to do this with pandas?

Silenced Temporarily · Accepted Answer

As you've found, read_csv takes its intended length from the first row, and treats only over-long rows as "bad"; rows with too few columns are padded with NA. Depending on the exact circumstances, there are a few solutions that may work for you:

Skip over the short lines.

If you know how many 2,000 column rows you will have, and especially if it's the same for each of your files, use skiprows to begin reading at the start of your 6,000 column section.

Read everything and then drop the bad rows.

This requires a bit of effort since, as you've found, pandas uses the first line of the file to determine the number of expected fields; you'd need to add a header with 6,000 columns or pass names with the correct length to read_csv. This will result in your 2,000 column lines being populated with a large number of NA values, after which you can use dropna with an appropriate threshold to drop rows with more than 4,000 NA.

Do some pre-processing on the files beforehand

If this is a one-time situation where you only need to process this current set of files and won't encounter new ones in the future, and you have command-line access where the files reside, it's one line of awk to pull out lines of the desired length:

awk -F, 'NF==6,000" file.csv > fixed_file.csv

Pandas: read.csv() - read only rows with certain column length

Answers (1)

Related Questions