I have a DataFrame with the column of file paths. I want to change it to only the file name. My DataFrame looks like: df = pd.DataFrame({ 'Sr No': [18, 19, 20], 'Email': ['Test@test.com', 'Test@test.com', 'Test@test.com'], 'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv', r'C:/Users\Test1.csv'] }) Sr No Email filename 18 Test@test.com C:/Users\Test.csv 19 Test@test.com C:/Users\Test1.csv 20 Test@test.com C:/Users\Test1.csv filename should be only Test and Test1 Just need to write Test@test.com at twice i.e. once for Test.csv and another for Test1.csv. In short, my output should look like: df = pd.DataFrame({ 'Sr No': [18, 19, 20], 'Email': ['Test@test.com', 'Test@test.com', 'Test@test.com'], 'filename': ['Test', 'Test1', 'Test1'] }) Sr No Email filename 18 Test@test.com Test 19 Test@test.com Test1 20 Test@test.com Test1 I want to do it using python and pandas DataFrame. I have 100 of rows in the 'filename' column. I tried using: import os import glob myfile = os.path.basename('C:/Users/Test.csv') os.path.splitext(myfile) print(os.path.splitext(myfile)[0]) But it is only useful for one path, how to apply it to entire column?

Reputation: 312

How to get the base file name from a column of paths

I have a DataFrame with the column of file paths.

I want to change it to only the file name.

My DataFrame looks like:

df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['[email protected]', '[email protected]', '[email protected]'],
    'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv',
                 r'C:/Users\Test1.csv']
})

Sr No	Email	filename
18	[email protected]	C:/Users\Test.csv
19	[email protected]	C:/Users\Test1.csv
20	[email protected]	C:/Users\Test1.csv

filename should be only Test and Test1
Just need to write [email protected] at twice i.e. once for Test.csv and another for Test1.csv.

In short, my output should look like:

df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['[email protected]', '[email protected]', '[email protected]'],
    'filename': ['Test', 'Test1', 'Test1']
})

Sr No	Email	filename
18	[email protected]	Test
19	[email protected]	Test1
20	[email protected]	Test1

I want to do it using python and pandas DataFrame.

I have 100 of rows in the 'filename' column.

I tried using:

import os

import glob

myfile = os.path.basename('C:/Users/Test.csv')
os.path.splitext(myfile)
print(os.path.splitext(myfile)[0])

But it is only useful for one path, how to apply it to entire column?

Upvotes: 4

Answers (5)

Pavithran Ramachandran

Reputation: 993

Use pandas.Series.apply to iterate through the column, and assign the result to new column.

df["filename"] = df["filename"].apply(os.path.basename)

df["filename"] = df["filename"].apply(lambda path: os.path.basename(path))

Example:

>>> df
   Sr No          Email            filename
0     18  [email protected]   C:/Users\Test.csv
1     19  [email protected]  C:/Users\Test1.csv
2     20  [email protected]  C:/Users\Test1.csv

>>> df["filename"] = df["filename"].apply(os.path.basename)
>>> df
   Sr No          Email   filename
0     18  [email protected]   Test.csv
1     19  [email protected]  Test1.csv
2     20  [email protected]  Test1.csv

There is also an option using Path('C:/Users\Test.csv').name from the pathlib module, but this is slower than os.path.basename because pathlib converts the string to a pathlib object.

Providing the slash prior to the file name is consistent, the fastest option is with pandas.Series.str.split (e.g. df['filename'].str.split('\\', expand=True).iloc[:, -1]).

Tested in python 3.11.2 and pandas 2.0.0

`%timeit` testing

import pandas as pd
import os
from pathlib import Path

# sample dataframe with 30000 rows
df = pd.DataFrame({'Sr No': [18, 19, 20],
                   'Email': ['[email protected]', '[email protected]', '[email protected]'],
                   'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv', r'C:/Users\Test1.csv']})
df = pd.concat([df] * 10000, ignore_index=True)

# timeit tests
%timeit df["filename"].apply(lambda path: Path(path).name)
%timeit df["filename"].apply(os.path.basename)
%timeit df["filename"].apply(lambda path: os.path.basename(path))
%timeit df['filename'].str.split('\\', expand=True).iloc[:, -1]

result

67.4 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.2 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 9

pgg08927

Reputation: 175

Assuming you have fixed length of directory in filename column as given:

# created dataframe for example
df = pd.DataFrame({'Email':['[email protected]','[email protected]','[email protected]'],
                   'filename':['c:/users\test.csv','c:/users\test1.csv','c:/users\test1.csv']} )   # dataframe

# will create new column with file name only
df['only_filename'] = [(path.encode('unicode_escape')[9:]).decode("utf-8") for path in df['filename']]

Upvotes: 0

GhandiFloss

Reputation: 384

In addition to the above answers you could also use the string methods:

df['filename'] = df['filename'].str.split('/')[-1]

Not sure which is fastest.

Upvotes: -1

Shrayani Mondal

Reputation: 170

Read the excel file into a pandas dataframe using

import pandas as pd
df = pd.read_excel("your excel file location")

Then use the apply function to perform one operation on the entire column as follows

def get_filename(path):
    temp_str = path.split('/')
    return temp_str[-1]

df["filename"] = df["filename"].apply(get_filename)

Upvotes: 0

Celius Stingher

Reputation: 18367

You would need to modify the existing column by redifining it. First read it with pandas:

import pandas as pd
df = pd.read_csv('file_path\file_name.csv')
df['filename'] = df['filename'].map(lambda x: x.split('\\')[-1][:-4])
df = df.drop_duplicates()

This yields the expect result as a dataframe, so all you are missing is saving it back to csv/excel:

df.to_excel('file_path\new_file_name.xlsx')

Or csv:

df.to_csv('file_path\new_file_name.csv')

Upvotes: 0

How to get the base file name from a column of paths

Answers (5)

%timeit testing

result

Related Questions

`%timeit` testing