V. Déhaye
V. Déhaye

Reputation: 505

Pandas dataframe take raises indices out-of-bounds

I am using pandas.DataFrame.take to keep only certain rows of a dataframe (the ones which value in one column matches a certain regex pattern).

In order to do so, I am building a list of the indices to keep in a loop checking that each row matches the pattern:

for index, row in combined_csv.iterrows():
     if re.match(regex_files_to_keep, row['commit_file']):
          indices_to_keep.append(index)

The index value is thus returned by pandas.DataFrame.iterrows.

My dataset is stored as a CSV file. It is too big to be read in one time, I am using the chunksize argument of pandas.read_csv.

The take applied to the first chunk works without any problem. However, from the second chunk on, it raises the following error:

IndexError: indices are out-of-bounds

I printed the list values and the indices of the first and last element of the data frame (using combined_csv.index[0] and combined_csv.index[-1]). All the values in the indices_to_keep list are within the boundaries defined by the indices of the first and last element of the data frame.

Why am I getting this error then ?

Upvotes: 0

Views: 736

Answers (1)

V. Déhaye
V. Déhaye

Reputation: 505

The answer was that the pandas.DataFrame.take method takes as argument the position of the row to remove in the current dataframe, and not its index. The confusion comes from the argument name which is indices, but the documentation explicitly states:

An array of ints indicating which positions to take

Let me explain the difference with an example.

Say you have a chunksize of 40000. The first index of your data frame built from your second chunk will then be 40000. However, the position of this row is 0, and that's the position value that take is expecting.

That's why you need to substract the number of rows you already went through (chunksize * (chunk_number - 1)) from your indices. My corresponding line of code is :

indices_to_keep = [x - (chunk_size * (chunk_number - 1)) for x in indices_to_keep]

Now you have a list of the positions of the rows to keep, and you can use the take as expected.

Please let me know if the vocabulary (position and index) is not appropriate so that I can correct it. I am not a native English speaker and the meaning of these words is very important in this problem.

Upvotes: 1

Related Questions