Coldchain9
Coldchain9

Reputation: 1745

Fastest way to iterate over a Pandas Dataframe while concatenating values from multiple columns

I am wondering if there is a more performant way to iterate through a pandas dataframe and concatenate values in different columns.

For example I have the below working:

import pandas as pd
from pathlib import Path

data = {'subdir': ['tom', 'phil', 'ava'],
        'filename':['9.wav', '8.wav', '7.wav'],
        'text':['Pizza','Strawberries and yogurt', 'potato']}

df = pd.DataFrame(data, columns = ['subdir', 'filename', 'text'])

df.head()

example_path = Path(r"C:\Hello\World")
for index, row in df.iterrows():
    full_path = example_path.joinpath(row['subdir'], row['filename'])
    print(full_path)
    text = row['text']
    print(text)

Output:

C:\Hello\World\tom\9.wav
Pizza
C:\Hello\World\phil\8.wav
Strawberries and yogurt
C:\Hello\World\ava\7.wav
potato

However, I have a large amount of rows and I would like to do this in the fastest way possible. What is the best way to do this? I am taking pieces of a path (subdirectory and the base file name) and concatenating them as I iterate through the dataframe.

I will also likely be grabbing data from other adjacent columns (like 'text' in the example) and storing them as I iterate over the dataframe, so I'd like to find a way to do this all in one go, as I will be taking these pieces to output a dictionary/dataframe object after I have gathered all of the data in list or series like structures.

Thank you.

Upvotes: 0

Views: 326

Answers (2)

Phillyclause89
Phillyclause89

Reputation: 680

You can always make a path column in your df using .apply method:

import pandas as pd
import pathlib

data = {'subdir': ['tom', 'phil', 'ava'],
        'filename':['9.wav', '8.wav', '7.wav'],
        'text':['Pizza','Strawberries and yogurt', 'potato']}

df = pd.DataFrame(data, columns = ['subdir', 'filename', 'text'])



df["path"] = df[['subdir','filename']].apply(
    lambda x:pathlib.Path(
        r"C:\Hello\World\{}\{}".format(
            x['subdir'],x['filename']
        )
    ),
    axis=1
)

print(df[['path','text']])

Out:

                        path                     text
0   C:\Hello\World\tom\9.wav                    Pizza
1  C:\Hello\World\phil\8.wav  Strawberries and yogurt
2   C:\Hello\World\ava\7.wav                   potato

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150785

Since you are using Path, you can just do:

 example_path/df.filename

Output (my system is Linux):

0    C:\Hello\World/9.wav
1    C:\Hello\World/8.wav
2    C:\Hello\World/7.wav
Name: filename, dtype: object

Note usually, string operations are not vectorized. The above piece of code might very well be just a wrapper for a for loop.

Upvotes: 1

Related Questions