Reputation: 15
I am currently dealing with 100,000+ lines in a csv file with a structure like this:
//f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11
1,false,false,0,"0","0","0","<color=asdf>d1, d2, d3<br>d4"d5", d6.</color>, 0, 1, false"
2,true,false,0,"0","0","0","<color=qwer>d1, "d2, d3"<br> d3<br>d4"d5" d6.</color>, 0, 1, false"
...
The top line has the structure of the file.
I have been trying to loop through the code each line, assign the corresponding value to each value (f1, f2...), and modify the texts using generators, but I am faced with the problem of splitting the code correctly.
If I simply split the lines using line.split(',')
, I will have:
['1', 'false', 'false', '0', '"0"', '"0"', '"0"', '"<color=asdf>d1', 'd2', 'd3<br>d4"d5"', 'd6.</color>, 0, 1, false"'
with list length of 14
when I want:
['1', 'false', 'false', '0', '"0"', '"0"', '"0"', '"<color=asdf>d1, d2, d3<br>d4"d5", d6.</color>, 0, 1, false"'
with list length of 11, so that the length of the list corresponds to the description line f1, f2... f11
It seems to me that commas within quotation marks ("") is causing this problem with .split(',')
method by splitting everything.
I have been trying to find a way I could merge the 'correct' items together, but to no avail. Can anyone help me with this issue?
Upvotes: 0
Views: 81
Reputation: 156
For this answer I will be assuming that the color tag corresponds to the column f8, and that the following integers and boolean correspond to the columns f9, f10, and f11 respectively. This does not match the placement of the quotation mark at the end of the boolean column, but it is the most obvious way I can think of to ensure that the example rows are indeed 14 and 11 columns long, respectively. If this was not the intended structure of the csv file, please clarify and I would be happy to edit the answer accordingly.
Instead of combining the strings yourself, you can use the parser from the standard library module csv to parse the file for you. For instance, a file named datafile.csv
could be displayed with:
import csv
with open("datafile.csv", 'r') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
print(row)
That said, testing this on the rows you provided gives
['1', 'false', 'false', '0', '0', '0', '0', '<color=asdf>d1, d2, d3<br>d4d5"', ' d6.</color>', ' 0', ' 1', ' false"']
['2', 'true', 'false', '0', '0', '0', '0', '<color=qwer>d1, d2', ' d3"<br> d3<br>d4"d5" d6.</color>', ' 0', ' 1', ' false"']
which does not match the result you were looking for. This is because the color tag, which I assume was meant to be one column, contains unescaped quotation marks. Thus, the parser can't tell where the column is supposed to end. If you control the code producing the csv file it would probably be best to change it there, but if you can't change the generation of the data file you will have to solve this problem too.
If the column f8 always corresponds to a color tag, you could use this knowledge to split it off and then split the remainder of the line on the commas. For this we can use the regular expressions module. Instead of using line.split(',')
we split the line in 3 portions: the part before the color tag, the tag itself, and the part after. This can be done with
result = re.match(r"(.*),(\"\<color.*\</color\>),(.*)", line)
result
will contain a Match
object, which will have 'capture groups' corresponding to the places surrounded by parentheses. we can extract these capture groups from the object by calling the method Match.group
, with as an argument the position of the group. The first and third group can safely be split on commas, since we have already observed that they only contain simple data without commas inside, and the second group containing the color tag does not need to be split since it contains a single column. Thus, we obtain the final list with
result.group(1).split(',') + [result.group(2)] + result.group(3).split(',')
if the placement of the quotation mark is as in the provided data but you would like to change it to be close to the color tag this can be done by removing the quote from the end of the third capture group and adding it to the second. Thus, the expression above becomes
result.group(1).split(',') + [result.group(2) + '"'] + result.group(3)[:-1].split(',')
where [:-1]
is used to slice off the last character. Combining the initial regex match statement with the expression above and using them to replace line.split(',')
should yield the desired result. With this method, displaying all rows of the csv file would work like so:
import re
with open("testing.csv", 'r') as csvfile:
for line in csvfile.readlines():
print(line)
result = re.match(r"(.*),(\"\<color.*\</color\>),(.*)", line)
print(result.group(1).split(',') + [result.group(2) + '"'] + result.group(3)[:-1].split(','))
I hope that this has helped you solve your problem. If not, please feel free to ask.
Upvotes: 2