Reputation: 2080
I have 25 data frames which I need to merge and find recurrently occurring rows from all 25 data frames, For example, my data frame looks like following,
df1
chr start end name
1 12334 12334 AAA
1 2342 2342 SAP
2 3456 3456 SOS
3 4537 4537 ABR
df2
chr start end name
1 12334 12334 DSF
1 3421 3421 KSF
2 7689 7689 LUF
df3
chr start end name
1 12334 12334 DSF
1 3421 3421 KSF
2 4537 4537 LUF
3 8976 8976 BAR
4 6789 6789 AIN
And In the end, I am aiming to have an output data frame like following,
chr start end name Sample
1 12334 12334 AAA df1
1 12334 12334 AAA df2
1 12334 12334 AAA df3
I can get there with the following solution, By dictionary which adds all these three data frames into one bigger data frame dfs
dfs = {'df1': df1, 'df2': df2}
Then further,
common_tups = set.intersection(*[set(df[['chr', 'start', 'end']].drop_duplicates().apply(tuple, axis=1).values) for df in dfs.values()])
pd.concat([df[df[['chr', 'start', 'end']].apply(tuple, axis=1).isin(common_tups)].assign(Sample=name) for (name, df) in dfs.items()])
This gives out the resulting data frame with matching rows from all three data frames, but I have 25 data frames which I am calling as list from the directory as following,
path = 'Fltered_vcfs/'
files = os.listdir(path)
results = [os.path.join(path,i) for i in files if i.startswith('vcf_filtered')]
And so how can I show the list 'results' in the dictionary and proceed further to get the desired output. Any help or suggestions are greatly appreciated.
Thank you
Upvotes: 1
Views: 224
Reputation: 76346
Using the glob
module, you can use
import os
from glob import glob
path = 'Fltered_vcfs'
f_names = glob(os.path.join(path, 'vcf_filtered*.*'))
Then, your dictionary can be created with dictionary comprehension using
import pandas as pd
{os.path.splitext(os.path.split(f_name)[1])[0]: pd.read_csv(f_name,sep='\t') for f_name in f_names}
Upvotes: 1