Reputation: 27
I'm new to Python. I searched other questions here and didn't find the exact situation I'm running into.
I need to be able to read in the contents of File-A and pull out the matching lines from File-B.
I know how to do this in PowerShell, but on large files it is very slow and I am trying to learn Python.
File-A contains just loan numbers - An 8 to 10 digit number per line File-A can contain 1 to thousands of lines
File-B can contain 1 to thousands of lines and has more data in it but each line will start with the same 8 to 10 digit loan number.
I need to read in File-A and find the matching line in File-B and write out those matching lines to a new file (all are text files)
Example contents of File-A - no spaces - 1 loan per line
272991
272992
272993
Example contents of File-B
272991~20210129~\\Serv1\LOC7\675309\867530\016618\272991.pdf~0
272992~20210129~\\Serv1\LOC7\675309\867530\016618\272992.pdf~0
272993~20210129~\\Serv1\LOC7\675309\867530\016618\272993.pdf~0
Is someone able to assist, point me in the right direction or better yet, provide a working solution?
Here is what I have tried so far but all it does is create the new PulledLoans.txt file with nothing in it
import os
# os.system('cls')
os.chdir('C:\\Temp\\')
print(os.getcwd())
# read file
loanFile = 'Loans.txt'
SourceFile = 'Orig.txt'
NewFile = 'PulledLoans.txt'
with open(loanFile, 'r') as content, open(SourceFile, 'r') as source:
# print(content.readlines())
for loan in content:
# print(loan, end='')
if loan in source:
print('found loan')
with open(SourceFile) as dfile, open(loanFile) as ifile:
lines = "\n".join(set(dfile.read().splitlines()) & set(ifile.read().splitlines()))
print(lines)
with open(NewFile, 'w') as ofile:
ofile.write(lines)
Upvotes: 0
Views: 907
Reputation: 25489
First, read everything from fileB
into a dictionary where the keys are the identifier, and the values are the entire line
file_b_data = dict()
with open("fileB") as f_b:
for line in f_b:
line = line.strip() # Remove whitespace at start and end
if not line:
continue # If the line is blank, skip
row = line.split("~") # Split by ~
identifier = row[0] # First element is the identifier
file_b_data[identifier] = line # Set the value of the dictionary
Next, read the lines from fileA
and get the matching values from your dictionary
with open("fileA") as f_a, open("outfile", "w") as f_w:
for identifier in f_a:
identifier = identifier.strip()
if not identifier:
continue
if identifier in file_b_data: # Check that the identifier exists in previously read data
out_line = file_b_data[identifier] + "\n" # Get the value from the dict
f_w.write(out_line) # Write it to output file
Alternatively, you could use the pandas
module to read all of fielA
and fileB
into a dataframe, and then find the correct rows.
import pandas as pd
file_b_data = pd.read_csv("fileB.txt", sep="~", names=["identifier", "date", "path", "something"], index_col=0)
Which gives us this dataframe:
identifier date path something
272991 20210129 \\Serv1\LOC7\675309\867530\016618\272991.pdf 0
272992 20210129 \\Serv1\LOC7\675309\867530\016618\272992.pdf 0
272993 20210129 \\Serv1\LOC7\675309\867530\016618\272993.pdf 0
Same for fileA
: (I removed 272992 to illustrate that it actually works)
file_a_data = pd.read_csv("fileA.txt", names="identifier")
giving us
identifier
0 272991
1 272993
Then, look for these indices in file_b_data
:
wanted_ids = file_a_data['identifiers']
wanted_rows = file_b_data.loc[wanted_ids, :]
wanted_rows.to_csv("out_file.txt", sep="~",header=None)
which will write this file: (notice the 272992
row is missing because it wasn't in fileA
)
272991~20210129~\\Serv1\LOC7\675309\867530\016618\272991.pdf~0
272993~20210129~\\Serv1\LOC7\675309\867530\016618\272993.pdf~0
Upvotes: 1