ryszard eggink
ryszard eggink

Reputation: 335

Finding the mean for cells, which satisfy a condition

We have cells in format 'aa-bb-cccccccc/yyyy/[email protected]', where xx are the captions in format (A0-Z9) and yyyy are the numbers in format (0001-9999). We want to find the mean of yyyy for cells, which have xx='X1'. The table is huge, so iterating over each cell seems problematic. Can I use here somehow function filter(like='X1'), I have troubles finding a way to extract yyyy where this condition is satisfied.

import pandas as pd
df = pd.read_csv("dane.csv", sep=';')
shape = list(df.shape)
for i in range(0,shape[1]):
    if df.iloc[:,i].filter(like='X1')

Example of input, the first row are column names.

1: Brandenburg (Post-Panamax)               2: Acheron (Feeder)                        
ES-NL-10633096/1938/[email protected]/6749   DE-JP-20438082/2066/[email protected]/68849
BE-BR-61613986/3551/[email protected]/39927         NL-LK-02275406/2136/[email protected]/73198
PH-SA-39552610/2436/[email protected]/80578
PA-AE-59814691/4881/[email protected]/25247  OM-PH-31303222/3671/[email protected]/52408

Upvotes: 0

Views: 34

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30991

As I understand, you want to compute one mean value, for all cells meeting the criterion, regardless of the column where it is.

Start from import re.

Then compile the regex:

pat = re.compile(r'(?:[A-Z]{2}-){2}\d+/(?P<n>\d+)/X1@[A-Z]+\.[A-Z]{2}',
    flags=re.I)

And the result can be computed with the following one-liner:

pd.Series(df.values.reshape(-1, 1).squeeze()).str.extract(pat)\
    .dropna().astype(int).n.mean()

For your sample data the result is 3409.5.

To get acquainted with the way the computation is done, execute each step as a separate instruction and watch the results.

Upvotes: 1

Related Questions