Reputation: 107

Python : group csv row by index

I got a csv like that :

sku, col1, col2, test, col3, col4, col5, col6
123,456,99,A,,,,
234,786,99,,,,,
345,678,99,A,,,,
123,,,A,18-123,9999,,
234,,,,18-786,9999,,
345,,,A,12-678,9999,,
123,,,,,,18-123,9999
234,,,A,,,18-786,9999
345,,,,,,12-678,9999

it's a result after combining 3 csv (different header / sku can be in 3 files with different column filled..)

My goal is to "group by sku" column in order to get something like taht :

I'm pretty new in python, so please excuse me if it's a trivial question, but i'm lost.

And I didn't know in advance my column name except the "sku" column who whas the "primary" key.

Thanks

Upvotes: 1

Answers (5)

U13-Forward

Reputation: 71570

Workable way:

newdf=pd.DataFrame()
newdf['sku']=df['sku'][:3]
for i in df.columns.tolist():
    if i != 'sku':
        newdf[i]=df[i].dropna().tolist()
print(newdf)

Upvotes: 0

Space Impact

Reputation: 13255

Using pandas.groupby and first:

df.groupby(['sku'],as_index=False).first()
    sku col1    col2    test    col3    col4    col5    col6
0   123 456.0   99.0    A   18-123  9999.0  18-123  9999.0
1   234 786.0   99.0    A   18-786  9999.0  18-786  9999.0
2   345 678.0   99.0    A   12-678  9999.0  12-678  9999.0

For removing .0 and converting to int data type using astype:

df.groupby(['sku'],as_index=False).first().astype(int,errors='ignore')
    sku col1 col2   test    col3    col4    col5    col6
0   123 456    99   A     18-123    9999    18-123  9999
1   234 786    99   A     18-786    9999    18-786  9999
2   345 678    99   A     12-678    9999    12-678  9999

Upvotes: 3

Arun Kumar Nagarajan

Reputation: 2417

Simple solution preserving order

import csv
from collections import OrderedDict

result = OrderedDict()
with open('data.csv', 'rb') as csvfile, open("out.csv", "w") as outfile:
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    header = False
    for row in csvreader:
        if not header:
            header = True
            size = len(row)
        sku = row[0]
        existing_val = result.setdefault(sku, [''] * size)
        for idx, each_val in enumerate(row):
            existing_val[idx] = existing_val[idx] or each_val


    csvwriter = csv.writer(outfile, delimiter=',', quotechar='"')
    for each_sku, val_list in result.iteritems():
        csvwriter.writerow(val_list)

Upvotes: 1

Rakesh

Reputation: 82765

Using the inbuild csv and collection module

Ex:

import csv
from collections import defaultdict


d = defaultdict(dict)
with open(filename, "rU") as infile:
    reader = csv.DictReader(infile)
    for line in reader:
        d[line["sku"]].update({k.strip(): v for k,v in line.items() if v})


with open(filename1, "w") as outfile:
    writer = csv.DictWriter(outfile, fieldnames=["sku", "col1", "col2", "test", "col3", "col4", "col5", "col6"])
    writer.writeheader()
    for k, v in d.items():
        writer.writerow(v)

Upvotes: 0

gaw

Reputation: 1960

If you can influence the initial processing of the csv files, I think what you want to look for is a join. Just read all the csv files and join them together on the "sku" column. However, it depends on the library or backend you want to use. If you are using something like spark, or pandas you can directly use joins. If you use plain python you would need some library or write the join by yourself, as it was described here: Stackoverflow Article for Joining

Upvotes: 0

Python : group csv row by index

Answers (5)

Related Questions