Chris Linke
Chris Linke

Reputation: 31

Matching values in two columns of inconsistent length

I am trying to retrive the results to a lab from a provided excel. Unfortunately, the excel data is structured as such:

df = pd.DataFrame({'Tests':['a\nb\nc', 'b\nc\na'], 'Results':['1.2\n3.4\n4', '2.3\n1\n1']})
df
   Tests       Results
0 a\nb\nc  1.2\n3.4\n4
1 b\nc\na  2.3\n1\n1

I want to be able to return the result that matches test 'a':

    Tests      Results   a result
0 a\nb\nc  1.2\n3.4\n4        1.2
1 b\nc\na    2.3\n1\n1          1

I know I can separate the different lines in the test and/or result column using

tests = df['Tests'].str.split('\n', expand=True)

When I try to apply that into a for loop (I hate for-loops too but can't think of a better method for this) I am not successful:

for row in df:
tests = df['Tests'].str.split('\n', expand=True)
if tests[0] == 'a':
    df['result'] = tests[0]
    #...and on through each possible row of col of tests expanded

I'm confident a better way exists, any direction is greatly appreciated.

Upvotes: 0

Views: 66

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35686

We can try concat after expanding all columns in df then do some reshaping with pivot (assuming each test value appears at most once per cell) to create a reasonable lookup table:

lookup_df = (
    pd.concat(
        [df[col].str.split('\n', expand=True) for col in df],
        keys=df.columns,
        axis=1
    )
        .rename_axis('index')
        .stack()
        .reset_index()
        .pivot(index='index', columns='Tests', values='Results')
        .add_prefix('results_')  # Optional Column Prefix
        .astype('float64')  # If wanting Numeric values
)
Tests results_a results_b results_c
index                              
0           1.2       3.4         4
1             1       2.3         1

This can be join back to df:

df = df.join(lookup_df)

df:

     Tests      Results results_a results_b results_c
0  a\nb\nc  1.2\n3.4\n4       1.2       3.4         4
1  b\nc\na    2.3\n1\n1         1       2.3         1

Alternatively, assuming the expanded tests and results are guaranteed to have the same dimensions when split and expanded. And we're only looking for one of the test results we can use numpy indexing to find the Results and create the column:

tests = df['Tests'].str.split('\n', expand=True)
results = df['Results'].str.split('\n', expand=True)
df['results_a'] = results.to_numpy()[tests.eq('a').to_numpy()]

df:

     Tests      Results results_a
0  a\nb\nc  1.2\n3.4\n4       1.2
1  b\nc\na    2.3\n1\n1         1

Upvotes: 1

Related Questions