Reputation: 47
I'm parsing two files which has data as shown below
File1:
UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67
file2:
UID X Y A Z B
------ ---------- ---------- ---------- ---------- ---------
456 536 1 148 304 234
1071 908 1 128 243 12
1118 4 8 52 162 123
249 4 8 68 154 987
1072 296 416 68 114 45
118 180 528 68 67 6
I will be comparing two such files, however the number of columns might vary and the columns names. For every unique UID, I need to match the column names, compare and find the difference.
Questions 1. Is there a way to access columns by column names instead of index? 2. Dynamically give column names based on the file data?
I'm able to load the file into list, and compare using indexes, but thats not a proper solutions.
Thanks in advance.
Upvotes: 0
Views: 1441
Reputation: 48317
This is an good use case for pandas, loading data is as simple as:
import pandas as pd
from StringIO import StringIO
data = """ UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67 """
df = pd.read_csv(StringIO(data),skiprows=[1],delimiter=r'\s+')
Let's inspect results:
>>> df
UID A B C D
0 456 536 1 148 304
1 1071 908 1 128 243
2 1118 4 8 52 162
3 249 4 8 68 154
4 1072 296 416 68 114
5 118 180 528 68 67
After obtaining df2 with similar means we can merge results:
>>> df.merge(df2, on=['UID'])
UID A_x B_x C D X Y A_y Z B_y
0 456 536 1 148 304 536 1 148 304 234
1 1071 908 1 128 243 908 1 128 243 12
2 1118 4 8 52 162 4 8 52 162 123
3 249 4 8 68 154 4 8 68 154 987
4 1072 296 416 68 114 296 416 68 114 45
5 118 180 528 68 67 180 528 68 67 6
Resulting pandas.DataFrame
has a very profound API, and all SQL-like analisis operations such as joining, filtering, grouping, aggregating etc are easy to perform. Look for examples on this site or in the documentation.
Upvotes: 2
Reputation: 15854
You might consider using csv.DictReader. It allows you both to address columns by names, and a variable list of columns for each file opened. Consider removing the ------
separating header from actual data as it might be read wrong.
Example:
import csv
with open('File1', 'r', newline='') as f:
# If you don't pass field names
# they are taken from the first row.
reader = csv.DictReader(f)
for line in reader:
# `line` is a dict {'UID': val, 'A': val, ... }
print line
If your input format has no clear delimiter (multiple whitespaces), you can wrap the file with a generator that will compress continous whitespaces into e.g. a comma:
import csv
import re
r = re.compile(r'[ ]+')
def trim_whitespaces(f):
for line in f:
yield r.sub(',', line)
with open('test.txt', 'r', newline='') as f:
reader = csv.DictReader(trim_whitespaces(f))
for line in reader:
print line
Upvotes: 2
Reputation: 113950
my_text = """UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67 """
lines = my_text.splitlines() #split your text into lines
keys= lines[0].split() #headers is your first line
table = [line.split() for line in lines[1:]] #the data is the rest
columns = zip(*table) #transpose the rows array to a columns array
my_dict = dict(zip(keys,columns)) #create a dict using your keys from earlier and matching them with columns
print my_dict['A'] #access
obviously you would need to change it if you had to read from a file say
alternatively this is what packages like pandas were made for
import pandas
table = pandas.read_csv('foo.csv', index_col=0)
Upvotes: 1