Novajik
Novajik

Reputation: 47

2d list in python - accessing through column names

I'm parsing two files which has data as shown below

File1:

    UID       A        B            C           D   
    ------ ---------- ---------- ---------- ---------- 
    456          536         1       148       304 
    1071         908         1       128       243 
    1118           4         8        52       162 
    249            4         8        68       154 
    1072         296       416        68       114 
    118          180       528        68        67 

file2:

    UID       X         Y            A           Z         B   
    ------ ---------- ---------- ---------- ---------- ---------
    456          536         1       148       304        234
    1071         908         1       128       243        12
    1118           4         8        52       162        123
    249            4         8        68       154        987
    1072         296       416        68       114         45
    118          180       528        68        67          6

I will be comparing two such files, however the number of columns might vary and the columns names. For every unique UID, I need to match the column names, compare and find the difference.

Questions 1. Is there a way to access columns by column names instead of index? 2. Dynamically give column names based on the file data?

I'm able to load the file into list, and compare using indexes, but thats not a proper solutions.

Thanks in advance.

Upvotes: 0

Views: 1441

Answers (3)

alko
alko

Reputation: 48317

This is an good use case for pandas, loading data is as simple as:

import pandas as pd
from StringIO import StringIO

data = """    UID       A        B            C           D
    ------ ---------- ---------- ---------- ----------
    456          536         1       148       304
    1071         908         1       128       243
    1118           4         8        52       162
    249            4         8        68       154
    1072         296       416        68       114
    118          180       528        68        67 """

df = pd.read_csv(StringIO(data),skiprows=[1],delimiter=r'\s+')

Let's inspect results:

>>> df
    UID    A    B    C    D
0   456  536    1  148  304
1  1071  908    1  128  243
2  1118    4    8   52  162
3   249    4    8   68  154
4  1072  296  416   68  114
5   118  180  528   68   67

After obtaining df2 with similar means we can merge results:

>>> df.merge(df2, on=['UID'])
    UID  A_x  B_x    C    D    X    Y  A_y    Z  B_y
0   456  536    1  148  304  536    1  148  304  234
1  1071  908    1  128  243  908    1  128  243   12
2  1118    4    8   52  162    4    8   52  162  123
3   249    4    8   68  154    4    8   68  154  987
4  1072  296  416   68  114  296  416   68  114   45
5   118  180  528   68   67  180  528   68   67    6

Resulting pandas.DataFrame has a very profound API, and all SQL-like analisis operations such as joining, filtering, grouping, aggregating etc are easy to perform. Look for examples on this site or in the documentation.

Upvotes: 2

Maciej Gol
Maciej Gol

Reputation: 15854

You might consider using csv.DictReader. It allows you both to address columns by names, and a variable list of columns for each file opened. Consider removing the ------ separating header from actual data as it might be read wrong.

Example:

import csv
with open('File1', 'r', newline='') as f:
    # If you don't pass field names
    # they are taken from the first row.
    reader = csv.DictReader(f)
    for line in reader:
        # `line` is a dict {'UID': val, 'A': val, ... }
        print line

If your input format has no clear delimiter (multiple whitespaces), you can wrap the file with a generator that will compress continous whitespaces into e.g. a comma:

import csv
import re

r = re.compile(r'[ ]+')


def trim_whitespaces(f):
    for line in f:
        yield r.sub(',', line)

with open('test.txt', 'r', newline='') as f:
    reader = csv.DictReader(trim_whitespaces(f))
    for line in reader:
        print line

Upvotes: 2

Joran Beasley
Joran Beasley

Reputation: 113950

my_text = """UID       A        B            C           D   
    ------ ---------- ---------- ---------- ---------- 
    456          536         1       148       304 
    1071         908         1       128       243 
    1118           4         8        52       162 
    249            4         8        68       154 
    1072         296       416        68       114 
    118          180       528        68        67     """
lines = my_text.splitlines() #split your text into lines
keys= lines[0].split() #headers is your first line
table = [line.split() for line in lines[1:]] #the data is the rest
columns = zip(*table) #transpose the rows array to a columns array
my_dict = dict(zip(keys,columns)) #create a dict using your keys from earlier and matching them with columns

print my_dict['A'] #access

obviously you would need to change it if you had to read from a file say

alternatively this is what packages like pandas were made for

import pandas
table = pandas.read_csv('foo.csv', index_col=0)

Upvotes: 1

Related Questions