Reputation: 312
I have a DataFrame with the column of file paths.
I want to change it to only the file name.
My DataFrame looks like:
df = pd.DataFrame({
'Sr No': [18, 19, 20],
'Email': ['[email protected]', '[email protected]', '[email protected]'],
'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv',
r'C:/Users\Test1.csv']
})
Sr No | filename | |
---|---|---|
18 | [email protected] | C:/Users\Test.csv |
19 | [email protected] | C:/Users\Test1.csv |
20 | [email protected] | C:/Users\Test1.csv |
In short, my output should look like:
df = pd.DataFrame({
'Sr No': [18, 19, 20],
'Email': ['[email protected]', '[email protected]', '[email protected]'],
'filename': ['Test', 'Test1', 'Test1']
})
Sr No | filename | |
---|---|---|
18 | [email protected] | Test |
19 | [email protected] | Test1 |
20 | [email protected] | Test1 |
I want to do it using python and pandas DataFrame.
I have 100 of rows in the 'filename' column.
I tried using:
import os
import glob
myfile = os.path.basename('C:/Users/Test.csv')
os.path.splitext(myfile)
print(os.path.splitext(myfile)[0])
But it is only useful for one path, how to apply it to entire column?
Upvotes: 4
Views: 4303
Reputation: 993
Use pandas.Series.apply
to iterate through the column, and assign the result to new column.
df["filename"] = df["filename"].apply(os.path.basename)
or
df["filename"] = df["filename"].apply(lambda path: os.path.basename(path))
Example:
>>> df
Sr No Email filename
0 18 [email protected] C:/Users\Test.csv
1 19 [email protected] C:/Users\Test1.csv
2 20 [email protected] C:/Users\Test1.csv
>>> df["filename"] = df["filename"].apply(os.path.basename)
>>> df
Sr No Email filename
0 18 [email protected] Test.csv
1 19 [email protected] Test1.csv
2 20 [email protected] Test1.csv
There is also an option using Path('C:/Users\Test.csv').name
from the pathlib
module, but this is slower than os.path.basename
because pathlib
converts the string
to a pathlib
object.
Providing the slash prior to the file name is consistent, the fastest option is with pandas.Series.str.split
(e.g. df['filename'].str.split('\\', expand=True).iloc[:, -1]
).
Tested in python 3.11.2
and pandas 2.0.0
%timeit
testingimport pandas as pd
import os
from pathlib import Path
# sample dataframe with 30000 rows
df = pd.DataFrame({'Sr No': [18, 19, 20],
'Email': ['[email protected]', '[email protected]', '[email protected]'],
'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv', r'C:/Users\Test1.csv']})
df = pd.concat([df] * 10000, ignore_index=True)
# timeit tests
%timeit df["filename"].apply(lambda path: Path(path).name)
%timeit df["filename"].apply(os.path.basename)
%timeit df["filename"].apply(lambda path: os.path.basename(path))
%timeit df['filename'].str.split('\\', expand=True).iloc[:, -1]
67.4 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.2 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 9
Reputation: 175
Assuming you have fixed length of directory in filename column as given:
# created dataframe for example
df = pd.DataFrame({'Email':['[email protected]','[email protected]','[email protected]'],
'filename':['c:/users\test.csv','c:/users\test1.csv','c:/users\test1.csv']} ) # dataframe
# will create new column with file name only
df['only_filename'] = [(path.encode('unicode_escape')[9:]).decode("utf-8") for path in df['filename']]
Upvotes: 0
Reputation: 384
In addition to the above answers you could also use the string methods:
df['filename'] = df['filename'].str.split('/')[-1]
Not sure which is fastest.
Upvotes: -1
Reputation: 170
Read the excel file into a pandas dataframe using
import pandas as pd
df = pd.read_excel("your excel file location")
Then use the apply function to perform one operation on the entire column as follows
def get_filename(path):
temp_str = path.split('/')
return temp_str[-1]
df["filename"] = df["filename"].apply(get_filename)
Upvotes: 0
Reputation: 18367
You would need to modify the existing column by redifining it. First read it with pandas:
import pandas as pd
df = pd.read_csv('file_path\file_name.csv')
df['filename'] = df['filename'].map(lambda x: x.split('\\')[-1][:-4])
df = df.drop_duplicates()
This yields the expect result as a dataframe, so all you are missing is saving it back to csv/excel:
df.to_excel('file_path\new_file_name.xlsx')
Or csv:
df.to_csv('file_path\new_file_name.csv')
Upvotes: 0