Reputation: 625

Add column to pandas dataframe with partial file name while importing many files

I use this to make a giant dataframe from many files in a directory:

path = r'C:\\Users\\me\\data\\'              
all_files = glob.glob(os.path.join(path, "*"))

df_from_each_file = (pd.read_csv(f, sep='\t') for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

The files in that path have names like

AAA.etc.etc.
AAA.etc.etc
BBB.etc.etc.

As I import each file, I want to add a column to the dataframe that has AAA or BBB next to all the rows imported from that file, like this:

col1  col2  col3
data1 data2 AAA
data3 data4 AAA
data1 data2 AAA
data3 data4 AAA
data1 data2 BBB
data3 data4 BBB

Upvotes: 0

Answers (3)

BENY

Reputation: 323226

You may check with keys + reset_index

key=[PureWindowsPath(i).name.split('.', 1)[0] for i in all_files]
concatdf = pd.concat(df_from_each_file, ignore_index=True,keys=key).reset_index(level=0)

Upvotes: 1

Joe Patten

Reputation: 1704

I usually change the current working directory to the path:

import os
os.chdir(path)

You can assign col3 to be the part of the filename you wish by using assign.

df_from_each_file = (pd.read_csv(f, sep='\t').assign(col3=f.split('.')[0]) for f in all_files)

So your code would look like:

os.chdir(path)
all_files = glob.glob('*')

df_from_each_file = (pd.read_csv(f).assign(col3=f.split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

If you don't want to change the current working directory, then you could use os.path.basename(path) to get the filenames in the path. so your code would look like:

all_files = glob.glob('*')
df_from_each_file = (pd.read_csv(f).assign(col3=os.path.basename(f).split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

Upvotes: 0

gold_cy

Reputation: 14216

This is one way to do it:

from pathlib import PureWindowsPath

def fn_helper(fn):
    df = pd.read_csv(fn, sep='\t')
    p = PureWindowsPath(fn)
    part = p.name.split('.')[0]
    df['col3'] = part
    return df

df_from_each_file = (fn_helper(f) for f in all_files)
...

Or as other people are showing with one-liners:

(pd.read_csv(f, sep='\t').assign(col3=PureWindowsPath(f).name.split('.')[0]) for f in all_files)

Upvotes: 2

Add column to pandas dataframe with partial file name while importing many files

Answers (3)

Related Questions