Reputation: 2080
I have multiple files, I need to select a specific column from each data frame and merge them finally. Here are my data frame examples,
>> df1.head()
ID df1 fox mnd
ADF 49.0 34.0 55.7
XCF 89.7 32.8 21.7
And the second and third data frames are,
>> df2.head()
ID lat2 df2 sap
ADF 67.00 84.00 95.70
XCF 59.70 62.80 11.70
BHG 89.00 54.89 0.34
>> df3.head()
ID df1 df2 df3
ADF 56.00 84.00 95.70
XCF 59.70 62.80 11.70
CXD 89.90 0.90 1.56
And so on, I have 37 such data frames with different dimensions. What I am looking for are:
First, I need to select only columns which are same as file name. For example, from df1
I need only column df1
.
Finally, concatenate them all together as a single data frame.
The problem here is: for some of the data frames I have column names of with all filenames in the path, but for few data frames I do only have columns names with the filename and additional columns with other names. so my following piece of code is selecting all the columns from each data frames. Which is not what I am looking for.
path = 'usr/fils/data'
all_files = [os.path.join(path, i) for i in os.listdir(path) if i.endswith('tsv')]
filenames = [os.path.basename(os.path.normpath(files)) for files in all_files]
main = []
for files in all_files:
for samids in filenames:
dfs = pd.DataFrame.from_csv(files, sep="\t")
dfs.reset_index(drop=False, inplace=True)
if samids in dfs.columns:
pc_matrix = dfs[[samids]]
main.append(pc_matrix)
merged = pd.concat(main, axis=1)
For example, here in this case merged
data frame consists of all columns from three data frames. However, this is not what I want.
In the end, I need my data frame should look like,
>> df_final
ID df1 df2 df3
ADF 49.00 84.00 95.70
XCF 89.70 62.80 11.70
BHG NA 89.00 NA
CXD NA NA 1.56
Any suggestions are much appreciated. Thank you!
Upvotes: 1
Views: 517
Reputation: 862521
Use:
#https://www.dropbox.com/sh/mytlp1t6bro1yly/AAAofCoHrwZTtnn04NFYGSb1a?dl=0
all_files = glob.glob('path/*')
main = []
for files in all_files:
c = os.path.basename(os.path.normpath(files))
try:
df = pd.read_csv(files, usecols=[c] + ['ID'], index_col=['ID'], sep='\t')
#if possible duplicated ID column - use mean or sum for unique values
#df = df.mean(level=0)
#df = df.sum(level=0)
print (df)
main.append(df)
except:
pass
merged = pd.concat(main,axis=1, sort=True)
print (merged)
df1 df2 df3
ADF 49.0 84.00 95.70
BHG NaN 54.89 NaN
CXD NaN NaN 1.56
XCF 89.7 62.80 11.70
Upvotes: 1