ADS KUL
ADS KUL

Reputation: 312

How to get the base file name from a column of paths

I have a DataFrame with the column of file paths.

I want to change it to only the file name.

My DataFrame looks like:

df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['[email protected]', '[email protected]', '[email protected]'],
    'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv',
                 r'C:/Users\Test1.csv']
})
Sr No Email filename
18 [email protected] C:/Users\Test.csv
19 [email protected] C:/Users\Test1.csv
20 [email protected] C:/Users\Test1.csv
  1. filename should be only Test and Test1
  2. Just need to write [email protected] at twice i.e. once for Test.csv and another for Test1.csv.

In short, my output should look like:

df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['[email protected]', '[email protected]', '[email protected]'],
    'filename': ['Test', 'Test1', 'Test1']
})
Sr No Email filename
18 [email protected] Test
19 [email protected] Test1
20 [email protected] Test1

I want to do it using python and pandas DataFrame.

I have 100 of rows in the 'filename' column.

I tried using:

import os

import glob

myfile = os.path.basename('C:/Users/Test.csv')
os.path.splitext(myfile)
print(os.path.splitext(myfile)[0])

But it is only useful for one path, how to apply it to entire column?

Upvotes: 4

Views: 4303

Answers (5)

Pavithran Ramachandran
Pavithran Ramachandran

Reputation: 993

Use pandas.Series.apply to iterate through the column, and assign the result to new column.

df["filename"] = df["filename"].apply(os.path.basename)

or

df["filename"] = df["filename"].apply(lambda path: os.path.basename(path))

Example:

>>> df
   Sr No          Email            filename
0     18  [email protected]   C:/Users\Test.csv
1     19  [email protected]  C:/Users\Test1.csv
2     20  [email protected]  C:/Users\Test1.csv

>>> df["filename"] = df["filename"].apply(os.path.basename)
>>> df
   Sr No          Email   filename
0     18  [email protected]   Test.csv
1     19  [email protected]  Test1.csv
2     20  [email protected]  Test1.csv

There is also an option using Path('C:/Users\Test.csv').name from the pathlib module, but this is slower than os.path.basename because pathlib converts the string to a pathlib object.

Providing the slash prior to the file name is consistent, the fastest option is with pandas.Series.str.split (e.g. df['filename'].str.split('\\', expand=True).iloc[:, -1]).

Tested in python 3.11.2 and pandas 2.0.0

%timeit testing

import pandas as pd
import os
from pathlib import Path

# sample dataframe with 30000 rows
df = pd.DataFrame({'Sr No': [18, 19, 20],
                   'Email': ['[email protected]', '[email protected]', '[email protected]'],
                   'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv', r'C:/Users\Test1.csv']})
df = pd.concat([df] * 10000, ignore_index=True)

# timeit tests
%timeit df["filename"].apply(lambda path: Path(path).name)
%timeit df["filename"].apply(os.path.basename)
%timeit df["filename"].apply(lambda path: os.path.basename(path))
%timeit df['filename'].str.split('\\', expand=True).iloc[:, -1]

result

67.4 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.2 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 9

pgg08927
pgg08927

Reputation: 175

Assuming you have fixed length of directory in filename column as given:

# created dataframe for example
df = pd.DataFrame({'Email':['[email protected]','[email protected]','[email protected]'],
                   'filename':['c:/users\test.csv','c:/users\test1.csv','c:/users\test1.csv']} )   # dataframe

# will create new column with file name only
df['only_filename'] = [(path.encode('unicode_escape')[9:]).decode("utf-8") for path in df['filename']]
 

Upvotes: 0

GhandiFloss
GhandiFloss

Reputation: 384

In addition to the above answers you could also use the string methods:

df['filename'] = df['filename'].str.split('/')[-1]

Not sure which is fastest.

Upvotes: -1

Shrayani Mondal
Shrayani Mondal

Reputation: 170

Read the excel file into a pandas dataframe using

import pandas as pd
df = pd.read_excel("your excel file location")

Then use the apply function to perform one operation on the entire column as follows

def get_filename(path):
    temp_str = path.split('/')
    return temp_str[-1]

df["filename"] = df["filename"].apply(get_filename)

Upvotes: 0

Celius Stingher
Celius Stingher

Reputation: 18367

You would need to modify the existing column by redifining it. First read it with pandas:

import pandas as pd
df = pd.read_csv('file_path\file_name.csv')
df['filename'] = df['filename'].map(lambda x: x.split('\\')[-1][:-4])
df = df.drop_duplicates()

This yields the expect result as a dataframe, so all you are missing is saving it back to csv/excel:

df.to_excel('file_path\new_file_name.xlsx')

Or csv:

df.to_csv('file_path\new_file_name.csv')

Upvotes: 0

Related Questions