Ling Guo
Ling Guo

Reputation: 592

Intersection of rows from text files using python

I have a number of text files which have text and numerical entries in them. More specifically, the first two rows and the first column of all these files are text, and all other entries are comprised of numbers. As an example, let's consider three separate text files (files A, B and C) in the formats as shown below.

File A:

Type    A1  A2  A3  A4  A5
Tag TagA1   TagA1   TagA1   TagA2   TagA2
object1 1.1 2.1 3.1 4.1 5.1
object2 1.2 2.2 3.2 4.2 5.2
object4 1.4 2.4 3.4 4.4 5.4
object7 1.7 2.7 3.7 4.7 5.7
object8 1.8 2.8 3.8 4.8 5.8
object9 1.9 2.9 3.9 4.9 5.9

File B:

Type    B1  B2  
Tag TagB1   TagB2   
object1 11.1    12.1    
object3 11.3    12.3    
object4 11.4    12.4    
object5 11.5    12.5    
object7 11.7    12.7    
object9 11.9    12.9    

File C:

Type    C1  C2  C3  C4  
Tag TagC1   TagC1   TagC2   TagC2   
object1 21.1    22.1    23.1    24.1    
object4 21.4    22.4    23.4    24.4    
object5 21.5    22.5    23.5    24.5    
object6 21.6    22.6    23.6    24.6    
object7 21.7    22.7    23.7    24.7    
object9 21.9    22.9    23.9    24.9    

I want to use python to create a separate text file which contains only those rows (starting from row 3) which are present in all these three files (i.e. intersection of rows). This final text file is expected to keep relevant columns from all these three files. Following would be the final file that I want:

Type    A1  A2  A3  A4  A5  B1  B2  C1  C2  C3  C4
Tag TagA1   TagA1   TagA1   TagA2   TagA2   TagB1   TagB2   TagC1   TagC1   TagC2   TagC2
object1 1.1 2.1 3.1 4.1 5.1 11.1    12.1    21.1    22.1    23.1    24.1
object4 1.4 2.4 3.4 4.4 5.4 11.4    12.4    21.4    22.4    23.4    24.4
object7 1.7 2.7 3.7 4.7 5.7 11.7    12.7    21.7    22.7    23.7    24.7
object9 1.9 2.9 3.9 4.9 5.9 11.9    12.9    21.9    22.9    23.9    24.9

I realize that I can use pandas to import each of the mentioned text files using a command like the following:

df = pd.read_csv('filename.txt',sep='\t', header=[0,1], index_col=0)

How can I use python to make this final file once I have files A, B and C? I would like to use python 2.7 for this.

Upvotes: 2

Views: 412

Answers (2)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

As @Jamin commented you can directly use join='inner' to be on the safest side without dropping the nans i.e.

df_m = pd.concat([A,B,C],axis=1,join='inner')

Also you can use concat and dropna, dropna will remove the rows with missing values so you will be left with only rows that are found in all the files i.e

df_m = pd.concat([A,B,C],axis=1).dropna()

Type       A1    A2    A3    A4    A5    B1    B2    C1    C2    C3    C4
Tag     TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
object1   1.1   2.1   3.1   4.1   5.1  11.1  12.1  21.1  22.1  23.1  24.1
object4   1.4   2.4   3.4   4.4   5.4  11.4  12.4  21.4  22.4  23.4  24.4
object7   1.7   2.7   3.7   4.7   5.7  11.7  12.7  21.7  22.7  23.7  24.7
object9   1.9   2.9   3.9   4.9   5.9  11.9  12.9  21.9  22.9  23.9  24.9

This solution holds good if you confirm that there are no nans in each of the dataframes.

Upvotes: 4

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

Use pd.merge() -

from functools import reduce
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on=None,left_index=True,right_index=True), dfs)

OR

df1.join(df2, how='inner').join(df3, how='inner')

These gives you a foolproof way of joining by keys -

   Type       A1    A2    A3    A4    A5    B1    B2    C1    C2    C3    C4
    Tag     TagA1 TagA1 TagA1 TagA2 TagA2 TagB1 TagB2 TagC1 TagC1 TagC2 TagC2
    object1   1.1   2.1   3.1   4.1   5.1  11.1  12.1  21.1  22.1  23.1  24.1
    object4   1.4   2.4   3.4   4.4   5.4  11.4  12.4  21.4  22.4  23.4  24.4
    object7   1.7   2.7   3.7   4.7   5.7  11.7  12.7  21.7  22.7  23.7  24.7
    object9   1.9   2.9   3.9   4.9   5.9  11.9  12.9  21.9  22.9  23.9  24.9

Timings

@Dark

2.63 ms ± 598 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Vivek

2.9 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Vivek (2nd - using join)

5.4 ms ± 662 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Since there is no hassle on going by keys, @Dark's solution is slightly more efficient. But since OP mentioned contains only those rows (starting from row 3) which are present in all these three files, I went by the merge way rather than the concat way

Upvotes: 6

Related Questions