Finding the mean for cells, which satisfy a condition

Question

We have cells in format 'aa-bb-cccccccc/yyyy/xx@ddddddddd.ee', where xx are the captions in format (A0-Z9) and yyyy are the numbers in format (0001-9999). We want to find the mean of yyyy for cells, which have xx='X1'. The table is huge, so iterating over each cell seems problematic. Can I use here somehow function filter(like='X1'), I have troubles finding a way to extract yyyy where this condition is satisfied.

import pandas as pd
df = pd.read_csv("dane.csv", sep=';')
shape = list(df.shape)
for i in range(0,shape[1]):
    if df.iloc[:,i].filter(like='X1')

Example of input, the first row are column names.

1: Brandenburg (Post-Panamax)               2: Acheron (Feeder)                        
ES-NL-10633096/1938/X1@hkzydbezon.dk/6749   DE-JP-20438082/2066/A2@qwinfhcaer.cu/68849
BE-BR-61613986/3551/B1@oqk.bf/39927         NL-LK-02275406/2136/A1@ozmmfdpfts.de/73198
PH-SA-39552610/2436/A1@venagi.hr/80578
PA-AE-59814691/4881/X1@zhicvzvksl.cl/25247  OM-PH-31303222/3671/Z1@jtqy.ml/52408

Valdi_Bo · Accepted Answer

As I understand, you want to compute one mean value, for all cells meeting the criterion, regardless of the column where it is.

Start from import re.

Then compile the regex:

pat = re.compile(r'(?:[A-Z]{2}-){2}\d+/(?P\d+)/X1@[A-Z]+\.[A-Z]{2}',
    flags=re.I)

And the result can be computed with the following one-liner:

pd.Series(df.values.reshape(-1, 1).squeeze()).str.extract(pat)\
    .dropna().astype(int).n.mean()

For your sample data the result is 3409.5.

To get acquainted with the way the computation is done, execute each step as a separate instruction and watch the results.

Finding the mean for cells, which satisfy a condition

Answers (1)

Related Questions