Reputation: 3
I am having problem in parsing the http requests. I have data like this in .txt in the link
https://drive.google.com/open?id=1RSyCYgxBCJnxAXDInyIs1cOp_3EoUyqG
I m trying to convert this data into csv format but the special characters like ';' separate out the data into new columns
Example: The data in the "Accept" column should be like - text/xml;q=0.6, application/rtf;q=0.7, image/*
but when i m running the code I m getting data in this column as text/xml and q=0.6 gets out to new column.
The one solution which I found was to convert single quote string to double quote and then stored the string, but this didn't work.from
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import urllib.parse
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
import io
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
import os
import json
import csv
from itertools import islice
import numpy as np
import pandas as pd
fields = ['Start - Id', 'class', 'Method', 'Url', 'Protocol', 'Content- Length','Content-Language','Content-Encoding','Content-Location','Content-MD5','Content-Type','Expires','Last-Modified', 'Host', 'Connection', 'Accept', 'Accept-Charset', 'Accept-Encoding', 'Accept-Language', 'Cache-Control','Client-ip', 'Cookie', 'Cookie2', 'Date', 'ETag', 'Expect', 'From', 'If-Modified-Since', 'If-Unmodified-Since', 'If-Match', 'If-None-Match', 'If-Range','Max-Forwards', 'MIME-Version', 'Pragma', 'Proxy-Authorization', 'Authorization', 'Range', 'Referer', 'TE', 'Trailer', 'User-Agent', 'UA-CPU', 'UA-Disp', 'UA-OS', 'UA-Color', 'UA-Pixels', 'Via', 'Transfer-Encoding', 'Upgrade', 'Warning', 'X-Forwarded-For', 'X-Serial-Number', '~~~~~','----']
listofKeys = dict.fromkeys(fields)
def init(file_out):
with open(file_out, 'w', encoding="utf-8") as csvfile:
csvwriter = csv.writer(csvfile, delimiter="\t")
csvwriter.writerow(fields)
def write(file_out, lines):
with open(file_out, 'a', encoding="utf-8") as csvfile:
csvwriter = csv.writer(csvfile, delimiter ="\t")
row = []
N = len(lines)
foundP = False
for i in range(N-1):
line = lines[i].strip()
if len(line)>0:
if i==2:
listofKeys['Method'] = line.split(" ")[0]
listofKeys['Url'] = line.split(" ")[1]
listofKeys['Protocol'] = line.split(" ")[2]
if(line.startswith("PUT") or line.startswith("POST")):
foundP = True
elif i==N-3 :
if foundP == True:
listofKeys['Url'] += (line)
else:
index = line.find(':')
key = line[0:index].strip()
value = line[index+1:].strip()
listofKeys[key] = str(value)
for keys in fields:
row.append(listofKeys[keys])
print(type(row))
print(row)
csvwriter.writerow(row)
def convertText2Csv(file_in, file_out):
init(file_out)
with open(file_in, 'r') as infile:
lines = []
count = 0
for line in infile:
if line.startswith("Start"):
count+=1
print("-------------------------------------------------------------------Request #",count," -------------------------------------------------------------------------")
lines.append(line)
elif line.startswith("End"):
lines.append(line)
write(file_out, lines)
lines = []
else:
lines.append(line)
csvFile = 'test.csv'
textFile = 'test.txt'
convertText2Csv(textFile, csvFile)
The result which I m getting is given in the link https://drive.google.com/open?id=1rLPdbuZkS6pcDQqHZZP6ck9H8XbnMPWM
I just want to convert the data into csv file with each column containing their specific value with special characters if present
Upvotes: 0
Views: 298
Reputation: 149155
Your csv file is perfectly correct.
Here is the content of the Accept
column when it is loaded in Libre Office calc and specifying "\t" as the only delimiter:
Accept
*/*
*/*
*/*
text/xml;q=0.6, application/rtf;q=0.7, image/*
Your real problem is that the program that you use to open the csv file is too clever (err.. stupid in fact!): it assumes that users are too stupid to know what a delimiter is and try to guess them. And here it made a wrong guess assuming that the ;
was a delimiter too.
Long story short: you are just trying to display a correct csv file with a stupid worksheet program (could it be Excel?). Excel is an very nice tool, except when it comes to csv files where it is just shit.
As you were suggested in comments, the quoting=csv.QUOTE_ALL
option which should be useless here, may be enough to explain that piece of crap that it should ignore maybe delimiters in fields...
Upvotes: 2