Mark Pampuch
Mark Pampuch

Reputation: 33

Why does indexing and slicing seem to change the way the dataframe looks in pandas

I'm trying to extract information from certain rows in this big dataframe.

When I do use slicing to subset the table (e.g. blast_output_scored.iloc[10:11,:]), The output looks like this:

qseqid sseqid %_identity alignment_length mismatch gapopen qstart qend sstart send evalue bitscore subject_strand line_in_og_BLAST Needle_score
IDgene.1 1 100.0 1073 0 0 1 1073 7704 6632 0.0 1982.0 minus 10 5360.0

When I go to check the number of rows in this table, I get the correct number with slicing

len(blast_output_scored.iloc[10:11,:].index)

#Output
1

However when I use indexing (blast_output_scored.iloc[10,:]), the output looks completely different, even if the index is the same range as the slice.

sseqid                   1
%_identity           100.0
alignment_length      1073
mismatch                 0
gapopen                  0
qstart                   1
qend                  1073
sstart                7704
send                  6632
evalue                 0.0
bitscore            1982.0
subject_strand       minus
line_in_og_BLAST        10
Needle_score        5360.0
Name: IDgene.1, dtype: object

The number of rows in the table now doesn't seems to change to the number of columns - the first column (the first column is also set for indexing rows by names)

len(blast_output_scored.iloc[10,:].index)

#Output
14

My biggest problem is that I'm using the column names to index and I have to check which names subset tables to a length of 1, so I can't just use the splicing method to bypass this.

e.g. blast_output_scored.loc["IDgene.1"] outputs

sseqid                   1
%_identity           100.0
alignment_length      1073
mismatch                 0
gapopen                  0
qstart                   1
qend                  1073
sstart                7704
send                  6632
evalue                 0.0
bitscore            1982.0
subject_strand       minus
line_in_og_BLAST        10
Needle_score        5360.0
Name: IDgene.1, dtype: object

and will say I have 14 rows when I should only have 1.

Is there any way to ensure the output looks like the slicing output in pandas?

Upvotes: 1

Views: 885

Answers (2)

BeRT2me
BeRT2me

Reputation: 13242

I believe this is why df.squeeze() is a thing. That way you easily force things to a series, and design your program to always expect a series.

Example:

df.iloc[0,:].squeeze()
df.iloc[0:1,:].squeeze()

# Both output:

sseqid                   1
%_identity           100.0
alignment_length      1073
mismatch                 0
gapopen                  0
qstart                   1
qend                  1073
sstart                7704
send                  6632
evalue                 0.0
bitscore            1982.0
subject_strand       minus
line_in_og_BLAST        10
Needle_score        5360.0
Name: IDgene.1, dtype: object

If we want a dataframe, we can force that as well, but it's a bit more complicated:

x = df.iloc[0, :].squeeze()
y = df.iloc[0:1,:].squeeze()
for d in [x, y]:
    print(pd.DataFrame(d).T)

# output:

         sseqid %_identity alignment_length mismatch gapopen qstart  qend sstart  send evalue bitscore subject_strand line_in_og_BLAST Needle_score
IDgene.1      1      100.0             1073        0       0      1  1073   7704  6632    0.0   1982.0          minus               10       5360.0
         sseqid %_identity alignment_length mismatch gapopen qstart  qend sstart  send evalue bitscore subject_strand line_in_og_BLAST Needle_score
IDgene.1      1      100.0             1073        0       0      1  1073   7704  6632    0.0   1982.0          minus               10       5360.0

Upvotes: 1

creanion
creanion

Reputation: 2743

When you slice - using an interval, you get a DataFrame back because the result (formally, at least) has multiple rows and multiple columns.

Take a look at

type(blast_output_scored.iloc[10:11,:])

It's a pandas.DataFrame.

Now let's look at:

type(blast_output_scored.iloc[10,:]

It's a pandas.Series.

The DataFrame and Series have quite different display in a notebook. They aren't that different, but they are a bit different. So it's good that we get a reminder that they are not the same thing.

When indexing with 10, you get the single row that corresponds to index 10. You get this as a Series. It's a one-dimensional datastructure that has an index and a sequence of values.

Since it has an index it can work very similarly to a DataFrame but with less degrees of freedom. Since it has one index and one value per unit of length, it's also vaguely similar to a dictionary or a mapping if you squint: keys (index) and values.


There are exceptions and maybe you'd be happier if you didn't know about them.

If the index of the original dataframe is non-unique, and you index using .loc[], there might be several rows that have the same index 10(!). What you get if you index with 10 in that case.. it changes and gives you a DataFrame since the result suddenly has two dimensions: multiple rows and multiple columns.

Upvotes: 2

Related Questions