Selecting specific columns from multiple data-frame if the column name is same as the dataframe name

Question

I have multiple files, I need to select a specific column from each data frame and merge them finally. Here are my data frame examples,

>> df1.head()

ID     df1    fox    mnd
ADF   49.0   34.0   55.7
XCF   89.7   32.8   21.7

And the second and third data frames are,

>> df2.head()

ID    lat2     df2     sap
ADF  67.00   84.00   95.70
XCF  59.70   62.80   11.70
BHG  89.00   54.89    0.34

>> df3.head()

ID     df1     df2     df3
ADF  56.00   84.00   95.70
XCF  59.70   62.80   11.70
CXD  89.90    0.90    1.56

And so on, I have 37 such data frames with different dimensions. What I am looking for are:

First, I need to select only columns which are same as file name. For example, from df1 I need only column df1.
Finally, concatenate them all together as a single data frame.

The problem here is: for some of the data frames I have column names of with all filenames in the path, but for few data frames I do only have columns names with the filename and additional columns with other names. so my following piece of code is selecting all the columns from each data frames. Which is not what I am looking for.

path = 'usr/fils/data'
all_files = [os.path.join(path, i) for i in os.listdir(path) if i.endswith('tsv')]
filenames = [os.path.basename(os.path.normpath(files)) for files in all_files]
main = []

for files in all_files:
    for samids in filenames:
        dfs = pd.DataFrame.from_csv(files, sep="	")
        dfs.reset_index(drop=False, inplace=True)
        if samids in dfs.columns:
            pc_matrix = dfs[[samids]]
            main.append(pc_matrix)
merged = pd.concat(main, axis=1)

For example, here in this case merged data frame consists of all columns from three data frames. However, this is not what I want.

In the end, I need my data frame should look like,

>> df_final

ID     df1     df2     df3
ADF  49.00   84.00   95.70
XCF  89.70   62.80   11.70
BHG     NA   89.00      NA
CXD     NA      NA    1.56

Any suggestions are much appreciated. Thank you!

jezrael · Accepted Answer

Use:

#https://www.dropbox.com/sh/mytlp1t6bro1yly/AAAofCoHrwZTtnn04NFYGSb1a?dl=0
all_files = glob.glob('path/*')
main = []
for files in all_files:
    c = os.path.basename(os.path.normpath(files))
    try:
        df = pd.read_csv(files, usecols=[c] + ['ID'], index_col=['ID'], sep='	')
        #if possible duplicated ID column - use mean or sum for unique values
        #df = df.mean(level=0)
        #df = df.sum(level=0)
        print (df)
        main.append(df)
    except:
        pass


merged = pd.concat(main,axis=1, sort=True)
print (merged)
      df1    df2    df3
ADF  49.0  84.00  95.70
BHG   NaN  54.89    NaN
CXD   NaN    NaN   1.56
XCF  89.7  62.80  11.70

Selecting specific columns from multiple data-frame if the column name is same as the dataframe name

Answers (1)

Related Questions