Cheyenne
Cheyenne

Reputation: 23

How to combine certain fields of different rows of csv file into one row

For my research I have a csv file in which per row a userId and a message and a label(about the userId) is stored as follows:

     UserId   txt         label
     1        This is a   true
     1        part of     true
     1        the whole   true
     1        message     true
     2        more        false
     2        text        false

What i would want to achieve is that I can combine for example two entries of every user in one row. So that would mean for the above sample I would want to get the following output in a csv:

     UserId   txt                label
     1        This is a part of  true
     2        more text          false

I don't know how to effectively achieve this (with python?), because the file is have contains 3 million rows with 20 thousand users. So I would like to end up with a file that has only 20 thousand rows.

Upvotes: 1

Views: 76

Answers (4)

Ramin Melikov
Ramin Melikov

Reputation: 1005

Here is an SQLite solution, which should be very fast.

import pandas as pd

import sqlite3 as db

path = 'path/to/some.csv'

df = pd.read_csv(path)

conn = db.connect('my_solution.db')

df.to_sql('table_from_df', conn, if_exists = 'replace', index = False)

sql_query = '''
    select
        userid,
        group_concat(txt, ' ') as txt
    from table_from_df
    group by 1
    order by 1
'''

out_df = pd.read_sql_query(sql_query, conn)

out_df

conn.close()

Upvotes: 0

JvdV
JvdV

Reputation: 75850

Here is a method using pandas, groupby in combination with join:

import pandas as pd
df = pd.read_csv(r'C:\YourDir\YourFile.csv',sep=',')
df = df.groupby(['UserId','label'])['txt'].apply(' '.join).reset_index()
print(df)

Result:

   UserId  label                                  txt
0       1   True  This is a part of the whole message
1       2  False                            more text

Note: Use the appropriate seperator for the sep parameter. I have used a comma.


You can write this back (overwrite) to csv like:

df.to_csv(r'C:\YourDir\YourFile.csv', sep=',', index=False)

Upvotes: 1

abhiarora
abhiarora

Reputation: 10430

How to combine certain fields of different rows of csv file into one row

Try this (Assuming file is delimited by ",", i.e., it is CSV):

di = {}
with open("file.txt", "r") as fi:
    fi.readline()
    for line in fi:
        l = [' '.join(i.split()) for i in line.split(',')]
        if l[0] in di:
            di[l[0]][0] += " " + l[1]
        else:
            di[l[0]] = [l[1], l[2]]
print(di)

with open("out.txt", "w") as fi:
    fi.write("UserId, txt, label\n")
    for k,v in di.items():
        fi.write("{},{},{}\n".format(k,v[0],v[1]))

Outputs:

{'1': ['This is a part of the whole message', 'true'], '2': ['more text', 'false']}

File: out.txt

UserId, txt, label
1,This is a part of the whole message,true
2,more text,false

File: file.txt:

UserId, txt, label
1, This is a, true
1, part of, true
1, the whole, true
1, message, true
2, more, false
2, text, false

Upvotes: 0

Pedro Lobito
Pedro Lobito

Reputation: 98921

Your file doesn't seem to be comma separated, being so, the following can help you:

import re
user_dict = {}
with open("csv_merge.csv") as f:
    for l in f:
        for m in re.finditer(r"^(\d+)\s*(.*?)\s*(true|false)\s*$", l, re.IGNORECASE):
            user, txt, label = m.group(1), m.group(2), m.group(3)
            if not user in user_dict:
                user_dict[user] = {"txt": txt, "label": label}
            else:
                user_dict[user]["txt"] += " "+txt
                # as far as I could understand, label doesn't change

with open("csv_merge_new.csv", "w") as f:
    f.write("UserId,txt,label\n") # comma separated
    for k, v in user_dict.items():
        f.write(f"{k},{v['txt']},{v['label']}\n")

UserId,txt,label
1,This is a part of the whole message,true
2,more text,false

DEMO

Upvotes: 0

Related Questions