SATYA NUGRAHA
SATYA NUGRAHA

Reputation: 23

universal new line mode in csv reader make csv writer write mistake line break in file

When i read csv file with universal line mode ("rU") cdv.reader it generates \r \n as new line in csv.writer. Do you know how to ignore new line in csv.writer? I had to use ("rU") in reader because my files contain new-line character.

this is the code i use

import csv

dict={}
with open('training_data.csv','rU') as f:
    reader = csv.reader(f,skipinitialspace=True)
for line in reader:
    try:
        dict[line[2]].append(line[3])
    except:
        dict[line[2]]=[line[3]]

with open('training_result.csv','w') as f:
writer = csv.writer(f, delimiter='|',dialect='excel-tab')
for key in dict:
    writer.writerow([key,','.join(dict[key])])

The input is like this

username, some of tweet that
want to be processed
by machine , label

Because that is line break and universal line mode activated, when i catch the data and want to write with csv writer it would be the same

What i want to be the output is like this

username, some of tweet that want to be processed by machine , label

Should i remove all of line breaks in csv file? But it is too large, the csv is around 150MB and contain 700 thousand row. Is there any approaches for this?

I already play with reader properties such as skipinitialspace and dialect, but still cannot handle the problem

Upvotes: 2

Views: 3019

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177941

I think this is the result you are looking for. You didn't mention your Python version. This is Python 3. I used your sample data uploaded to Google Drive. The file parsed as UTF-8.

Key things to note:

  • csv has a DictReader to help select columns for processing.
  • CSV files should be opened in binary mode. In Python 2 that's just 'rb' or 'wb' but in Python 3 it means 'r',newline='' and an encoding to the open call.
  • line will be a dictionary of {'header':'value'} pairs.
  • extrasaction tells DictWriter to ignore extra fields in the dictionary not listed in fieldnames.

Sample data:

twitter.place.full_name,twitter.user.location,interaction.author.username,interaction.content,interaction.created_at
"Gunungsari, Lombok Barat",Indonesia,__Thasya__,At Sheraton Senggigi Beach Resort äóî https://t.co/1FdTsMYWje,"Mon, 16 Jun 2014 15:32:54 +0000"
"Cakranegara, Kota Mataram",NULL,__Waone,Mataram,"Mon, 24 Mar 2014 13:13:46 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,"perdana, my first nephew from my lil sibling sister,,,

*moga gäó» ketularan songong kayak pamannya >_< http://t.co/UBEwcxWY5c","Sat, 04 Jan 2014 04:20:45 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,"@indiraputeri udah pinter bahasa sasak nih skrng,,, inaq rari","Sat, 04 Jan 2014 06:15:52 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,@indiraputeri dalemmm bgt nih ndoro .. !!! mksd nya apaan?,"Sat, 04 Jan 2014 05:55:04 +0000"
"Keruak, Lombok Timur",Jakarta,_5at,"pagi2, hujan, holiday, nasi goreng hangat, kopi hangat, di rumah, + spesial: kumpul keluarga,,, ^_^  *kurang_apa_lagi","Thu, 02 Jan 2014 00:02:47 +0000"
"Pujut, Lombok Tengah",Jakarta,_5at,"Doäó»a bepergian keluar rumah:

""Bismillaahitawakkaltu äó»alallooh""

*pasrah-pasrah-pasrah;
*bandara_international_lombok","Sun, 05 Jan 2014 03:36:48 +0000"
"Sakra, Lombok Timur",Jakarta,_5at,"Time for riding with my lil bro:
Mataram - Senggigi - Gili Terawangan
*jenguk_ponakan_baru;
*very_early","Fri, 03 Jan 2014 22:09:26 +0000"
"Sukamulia, Lombok Timur",,1821922,Salam friend,"Sun, 20 Jul 2014 19:23:53 +0000"

Code:

import csv

# Python 2 version of opens
#with open('training_data.csv','rb') as inp:
#    with open('training_result.csv','wb') as outp:

with open('training_data.csv','r',newline='',encoding='utf8') as inp:
    with open('training_result.csv','w',newline='',encoding='utf8') as outp:
        reader = csv.DictReader(inp)
        writer = csv.DictWriter(outp,
                                fieldnames=['interaction.author.username','interaction.content'],
                                extrasaction='ignore')
        writer.writeheader()
        for line in reader:
            line['interaction.content'] = line['interaction.content'].replace('\n',' ')
            writer.writerow(line)

Result:

interaction.author.username,interaction.content
__Thasya__,At Sheraton Senggigi Beach Resort äóî https://t.co/1FdTsMYWje
__Waone,Mataram
_5at,"perdana, my first nephew from my lil sibling sister,,,  *moga gäó» ketularan songong kayak pamannya >_< http://t.co/UBEwcxWY5c"
_5at,"@indiraputeri udah pinter bahasa sasak nih skrng,,, inaq rari"
_5at,@indiraputeri dalemmm bgt nih ndoro .. !!! mksd nya apaan?
_5at,"pagi2, hujan, holiday, nasi goreng hangat, kopi hangat, di rumah, + spesial: kumpul keluarga,,, ^_^  *kurang_apa_lagi"
_5at,"Doäó»a bepergian keluar rumah:  ""Bismillaahitawakkaltu äó»alallooh""  *pasrah-pasrah-pasrah; *bandara_international_lombok"
_5at,Time for riding with my lil bro: Mataram - Senggigi - Gili Terawangan *jenguk_ponakan_baru; *very_early
1821922,Salam friend

Upvotes: 1

Lalith J.
Lalith J.

Reputation: 1391

We can achieve this by replacing new lines by ", " and adding a new line for each new append. IF you do not want any new lines you can remove \n

dict[line[2]].append(line[3].replace("\n", ", "));

Here is the code

import csv;

dict={};
with open('training_data.csv','rU') as f:
    reader = csv.reader(f,skipinitialspace=True);
    for line in reader:
        try:
            dict[line[2]].append("\n"+line[3].replace("\n", ", "));
        except:
            dict[line[2]]=[line[3].replace("\n", ", ")];


with open('training_result.csv','w') as f:
    writer = csv.writer(f, delimiter=',',dialect='excel-tab');
    for key in dict:
        writer.writerow([key,','.join(dict[key])]);

Upvotes: 0

Related Questions