Dealing with 5,000 attributes

Question

I have a data set which contains 5,000 + attributes The tables looks like below

id attr1 attr2, attr3
a  0         1       0
a  1         0       0
a  0         0       0
a  0         0       1

I wish to represent each record on a single row for example the table below to make it more amenable to data mining via clustering.

id, attr1, attr2, attr3
a     1       1        1

I have tried a multitude of ways of doing this I have tried importing it into a MYSQL DB and getting the max value for each attribute (they can only be 1 or zero for each ID) but a table cant hold the 5,000 + attributes.

I have tried using the pivot function in excel and getting the Max Value per attribute but the number of columns a pivot can handle is far less than the 5,000 I'm currently looking at.

I have tried importing it into Tableua but that also suffers from the fact it cant handle so many records

I just want to get Table 2 in either a text/CSV file or a database table

Can anyone suggest anything at all, a piece of software or something i have not yet considered

msw · Accepted Answer

Here is a Python script which does what you ask for

def merge_rows_by_id(path):
    rows = dict()
    with open(path) as in_file:
        header = in_file.readline().rstrip()
        for line in in_file:
            fields = line.split()
            id, attributes = fields[0], fields[1:]
            if id not in rows:
                rows[id] = attributes
            else:
                rows[id] = [max(x) for x in zip(rows[id], attributes)]

    print (header)
    for id in rows:
        print ('{},{}'.format(id, ','.join(rows[id])))

 merge_rows_by_id('my-data.txt')

Which was written for clarity more than maximum efficiency, although it's pretty efficient. However, this will still leave you with lines with 5000 attributes, just fewer of them.

I've seen this data "structure" too often used in bioinformatics where the researchers just say "put everything we know about "a" on one row, and then the set of "everything" doubles, and re-doubles, etc. I've had to teach them about data normalization to make an RDBM handle what they've got. Usually, attr_1…n are from one trial and attr_n+1…m is from a second trial, and so on which allows for a sensible normalization of the data.

Dealing with 5,000 attributes

Answers (1)

Related Questions