Reputation: 23
When i read csv file with universal line mode ("rU") cdv.reader it generates \r \n as new line in csv.writer. Do you know how to ignore new line in csv.writer? I had to use ("rU") in reader because my files contain new-line character.
this is the code i use
import csv
dict={}
with open('training_data.csv','rU') as f:
reader = csv.reader(f,skipinitialspace=True)
for line in reader:
try:
dict[line[2]].append(line[3])
except:
dict[line[2]]=[line[3]]
with open('training_result.csv','w') as f:
writer = csv.writer(f, delimiter='|',dialect='excel-tab')
for key in dict:
writer.writerow([key,','.join(dict[key])])
The input is like this
username, some of tweet that
want to be processed
by machine , label
Because that is line break and universal line mode activated, when i catch the data and want to write with csv writer it would be the same
What i want to be the output is like this
username, some of tweet that want to be processed by machine , label
Should i remove all of line breaks in csv file? But it is too large, the csv is around 150MB and contain 700 thousand row. Is there any approaches for this?
I already play with reader properties such as skipinitialspace and dialect, but still cannot handle the problem
Upvotes: 2
Views: 3019
Reputation: 177941
I think this is the result you are looking for. You didn't mention your Python version. This is Python 3. I used your sample data uploaded to Google Drive. The file parsed as UTF-8.
Key things to note:
csv
has a DictReader
to help select columns for processing.'rb'
or 'wb'
but in Python 3 it means 'r',newline=''
and an encoding to the open
call.line
will be a dictionary of {'header':'value'} pairs.extrasaction
tells DictWriter
to ignore extra fields in the dictionary not listed in fieldnames
.Sample data:
twitter.place.full_name,twitter.user.location,interaction.author.username,interaction.content,interaction.created_at
"Gunungsari, Lombok Barat",Indonesia,__Thasya__,At Sheraton Senggigi Beach Resort äóî https://t.co/1FdTsMYWje,"Mon, 16 Jun 2014 15:32:54 +0000"
"Cakranegara, Kota Mataram",NULL,__Waone,Mataram,"Mon, 24 Mar 2014 13:13:46 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,"perdana, my first nephew from my lil sibling sister,,,
*moga gäó» ketularan songong kayak pamannya >_< http://t.co/UBEwcxWY5c","Sat, 04 Jan 2014 04:20:45 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,"@indiraputeri udah pinter bahasa sasak nih skrng,,, inaq rari","Sat, 04 Jan 2014 06:15:52 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,@indiraputeri dalemmm bgt nih ndoro .. !!! mksd nya apaan?,"Sat, 04 Jan 2014 05:55:04 +0000"
"Keruak, Lombok Timur",Jakarta,_5at,"pagi2, hujan, holiday, nasi goreng hangat, kopi hangat, di rumah, + spesial: kumpul keluarga,,, ^_^ *kurang_apa_lagi","Thu, 02 Jan 2014 00:02:47 +0000"
"Pujut, Lombok Tengah",Jakarta,_5at,"Doäó»a bepergian keluar rumah:
""Bismillaahitawakkaltu äó»alallooh""
*pasrah-pasrah-pasrah;
*bandara_international_lombok","Sun, 05 Jan 2014 03:36:48 +0000"
"Sakra, Lombok Timur",Jakarta,_5at,"Time for riding with my lil bro:
Mataram - Senggigi - Gili Terawangan
*jenguk_ponakan_baru;
*very_early","Fri, 03 Jan 2014 22:09:26 +0000"
"Sukamulia, Lombok Timur",,1821922,Salam friend,"Sun, 20 Jul 2014 19:23:53 +0000"
Code:
import csv
# Python 2 version of opens
#with open('training_data.csv','rb') as inp:
# with open('training_result.csv','wb') as outp:
with open('training_data.csv','r',newline='',encoding='utf8') as inp:
with open('training_result.csv','w',newline='',encoding='utf8') as outp:
reader = csv.DictReader(inp)
writer = csv.DictWriter(outp,
fieldnames=['interaction.author.username','interaction.content'],
extrasaction='ignore')
writer.writeheader()
for line in reader:
line['interaction.content'] = line['interaction.content'].replace('\n',' ')
writer.writerow(line)
Result:
interaction.author.username,interaction.content
__Thasya__,At Sheraton Senggigi Beach Resort äóî https://t.co/1FdTsMYWje
__Waone,Mataram
_5at,"perdana, my first nephew from my lil sibling sister,,, *moga gäó» ketularan songong kayak pamannya >_< http://t.co/UBEwcxWY5c"
_5at,"@indiraputeri udah pinter bahasa sasak nih skrng,,, inaq rari"
_5at,@indiraputeri dalemmm bgt nih ndoro .. !!! mksd nya apaan?
_5at,"pagi2, hujan, holiday, nasi goreng hangat, kopi hangat, di rumah, + spesial: kumpul keluarga,,, ^_^ *kurang_apa_lagi"
_5at,"Doäó»a bepergian keluar rumah: ""Bismillaahitawakkaltu äó»alallooh"" *pasrah-pasrah-pasrah; *bandara_international_lombok"
_5at,Time for riding with my lil bro: Mataram - Senggigi - Gili Terawangan *jenguk_ponakan_baru; *very_early
1821922,Salam friend
Upvotes: 1
Reputation: 1391
We can achieve this by replacing new lines by ", " and adding a new line for each new append. IF you do not want any new lines you can remove \n
dict[line[2]].append(line[3].replace("\n", ", "));
Here is the code
import csv;
dict={};
with open('training_data.csv','rU') as f:
reader = csv.reader(f,skipinitialspace=True);
for line in reader:
try:
dict[line[2]].append("\n"+line[3].replace("\n", ", "));
except:
dict[line[2]]=[line[3].replace("\n", ", ")];
with open('training_result.csv','w') as f:
writer = csv.writer(f, delimiter=',',dialect='excel-tab');
for key in dict:
writer.writerow([key,','.join(dict[key])]);
Upvotes: 0