Reputation: 23
For my research I have a csv file in which per row a userId and a message and a label(about the userId) is stored as follows:
UserId txt label
1 This is a true
1 part of true
1 the whole true
1 message true
2 more false
2 text false
What i would want to achieve is that I can combine for example two entries of every user in one row. So that would mean for the above sample I would want to get the following output in a csv:
UserId txt label
1 This is a part of true
2 more text false
I don't know how to effectively achieve this (with python?), because the file is have contains 3 million rows with 20 thousand users. So I would like to end up with a file that has only 20 thousand rows.
Upvotes: 1
Views: 76
Reputation: 1005
Here is an SQLite
solution, which should be very fast.
import pandas as pd
import sqlite3 as db
path = 'path/to/some.csv'
df = pd.read_csv(path)
conn = db.connect('my_solution.db')
df.to_sql('table_from_df', conn, if_exists = 'replace', index = False)
sql_query = '''
select
userid,
group_concat(txt, ' ') as txt
from table_from_df
group by 1
order by 1
'''
out_df = pd.read_sql_query(sql_query, conn)
out_df
conn.close()
Upvotes: 0
Reputation: 75850
Here is a method using pandas
, groupby
in combination with join
:
import pandas as pd
df = pd.read_csv(r'C:\YourDir\YourFile.csv',sep=',')
df = df.groupby(['UserId','label'])['txt'].apply(' '.join).reset_index()
print(df)
Result:
UserId label txt
0 1 True This is a part of the whole message
1 2 False more text
Note: Use the appropriate seperator for the sep
parameter. I have used a comma.
You can write this back (overwrite) to csv
like:
df.to_csv(r'C:\YourDir\YourFile.csv', sep=',', index=False)
Upvotes: 1
Reputation: 10430
How to combine certain fields of different rows of csv file into one row
Try this (Assuming file is delimited by ",", i.e., it is CSV):
di = {}
with open("file.txt", "r") as fi:
fi.readline()
for line in fi:
l = [' '.join(i.split()) for i in line.split(',')]
if l[0] in di:
di[l[0]][0] += " " + l[1]
else:
di[l[0]] = [l[1], l[2]]
print(di)
with open("out.txt", "w") as fi:
fi.write("UserId, txt, label\n")
for k,v in di.items():
fi.write("{},{},{}\n".format(k,v[0],v[1]))
Outputs:
{'1': ['This is a part of the whole message', 'true'], '2': ['more text', 'false']}
File: out.txt
UserId, txt, label
1,This is a part of the whole message,true
2,more text,false
File: file.txt:
UserId, txt, label
1, This is a, true
1, part of, true
1, the whole, true
1, message, true
2, more, false
2, text, false
Upvotes: 0
Reputation: 98921
Your file doesn't seem to be comma separated, being so, the following can help you:
import re
user_dict = {}
with open("csv_merge.csv") as f:
for l in f:
for m in re.finditer(r"^(\d+)\s*(.*?)\s*(true|false)\s*$", l, re.IGNORECASE):
user, txt, label = m.group(1), m.group(2), m.group(3)
if not user in user_dict:
user_dict[user] = {"txt": txt, "label": label}
else:
user_dict[user]["txt"] += " "+txt
# as far as I could understand, label doesn't change
with open("csv_merge_new.csv", "w") as f:
f.write("UserId,txt,label\n") # comma separated
for k, v in user_dict.items():
f.write(f"{k},{v['txt']},{v['label']}\n")
UserId,txt,label
1,This is a part of the whole message,true
2,more text,false
Upvotes: 0