Reputation: 107
I got a csv like that :
sku, col1, col2, test, col3, col4, col5, col6
123,456,99,A,,,,
234,786,99,,,,,
345,678,99,A,,,,
123,,,A,18-123,9999,,
234,,,,18-786,9999,,
345,,,A,12-678,9999,,
123,,,,,,18-123,9999
234,,,A,,,18-786,9999
345,,,,,,12-678,9999
it's a result after combining 3 csv (different header / sku can be in 3 files with different column filled..)
My goal is to "group by sku" column in order to get something like taht :
I'm pretty new in python, so please excuse me if it's a trivial question, but i'm lost.
And I didn't know in advance my column name except the "sku" column who whas the "primary" key.
Thanks
Upvotes: 1
Views: 1091
Reputation: 71570
Workable way:
newdf=pd.DataFrame()
newdf['sku']=df['sku'][:3]
for i in df.columns.tolist():
if i != 'sku':
newdf[i]=df[i].dropna().tolist()
print(newdf)
Upvotes: 0
Reputation: 13255
Using pandas.groupby
and first
:
df.groupby(['sku'],as_index=False).first()
sku col1 col2 test col3 col4 col5 col6
0 123 456.0 99.0 A 18-123 9999.0 18-123 9999.0
1 234 786.0 99.0 A 18-786 9999.0 18-786 9999.0
2 345 678.0 99.0 A 12-678 9999.0 12-678 9999.0
For removing .0
and converting to int
data type using astype
:
df.groupby(['sku'],as_index=False).first().astype(int,errors='ignore')
sku col1 col2 test col3 col4 col5 col6
0 123 456 99 A 18-123 9999 18-123 9999
1 234 786 99 A 18-786 9999 18-786 9999
2 345 678 99 A 12-678 9999 12-678 9999
Upvotes: 3
Reputation: 2417
Simple solution preserving order
import csv
from collections import OrderedDict
result = OrderedDict()
with open('data.csv', 'rb') as csvfile, open("out.csv", "w") as outfile:
csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
header = False
for row in csvreader:
if not header:
header = True
size = len(row)
sku = row[0]
existing_val = result.setdefault(sku, [''] * size)
for idx, each_val in enumerate(row):
existing_val[idx] = existing_val[idx] or each_val
csvwriter = csv.writer(outfile, delimiter=',', quotechar='"')
for each_sku, val_list in result.iteritems():
csvwriter.writerow(val_list)
Upvotes: 1
Reputation: 82765
Using the inbuild csv
and collection module
Ex:
import csv
from collections import defaultdict
d = defaultdict(dict)
with open(filename, "rU") as infile:
reader = csv.DictReader(infile)
for line in reader:
d[line["sku"]].update({k.strip(): v for k,v in line.items() if v})
with open(filename1, "w") as outfile:
writer = csv.DictWriter(outfile, fieldnames=["sku", "col1", "col2", "test", "col3", "col4", "col5", "col6"])
writer.writeheader()
for k, v in d.items():
writer.writerow(v)
Upvotes: 0
Reputation: 1960
If you can influence the initial processing of the csv files, I think what you want to look for is a join. Just read all the csv files and join them together on the "sku" column. However, it depends on the library or backend you want to use. If you are using something like spark, or pandas you can directly use joins. If you use plain python you would need some library or write the join by yourself, as it was described here: Stackoverflow Article for Joining
Upvotes: 0