Reputation: 65
I have a large file from a proprietary archive format. Unzipping this archive gives a file that has no extension, but the data inside is comma-delimited. Adding a .csv
extension or simply opening the file with Excel will work.
I have about 375-400 of these files, and I'm trying to extract a chunk of rows (about 13,500 out of 1.2M+ rows) between a keyword "Point A" and another keyword "Point B".
I found some code on this site that I think is extracting the data correctly, but I'm getting an error:
AttributeError: 'list' object has no attribute 'rows'
when trying to save out the file. Can somebody help me get this data to save into a csv?
import re
import csv
import time
print(time.ctime())
file = open('C:/Users/User/Desktop/File with No Extension That\'s Very Similar to CSV', 'r')
data = file.read()
x = re.findall(r'Point A(.*?)Point B', data,re.DOTALL)
name = "C:/Users/User/Desktop/testoutput.csv"
with open(name, 'w', newline='') as file2:
savefile = csv.writer(file2)
for i in x.rows:
savefile.writerow([cell.value for cell in i])
print(time.ctime())
Thanks in advance, any help would be much appreciated.
Upvotes: 0
Views: 741
Reputation: 459
Updating the code so that @Martin Evans answer works on the latest Python version.
import re
import csv
import time
import io
print(time.ctime())
input_name = "C:/Users/User/Desktop/File with No Extension That's Very Similar to CSV"
output_name = "C:/Users/User/Desktop/testoutput.csv"
with open(input_name, 'r') as f_input, open(output_name, 'wt') as f_output:
# Read whole file in
all_input = f_input.read()
# Extract interesting lines
ab_input = re.findall(r'Point A(.*?)Point B', all_input, re.DOTALL)[0]
# Convert into a file object and parse using the CSV reader
fab_input = io.StringIO(ab_input)
csv_input = csv.reader(fab_input)
csv_output = csv.writer(f_output)
# Iterate a row at a time from the input
for input_row in csv_input:
# Skip any empty rows
if input_row:
# Write row at a time to the output
csv_output.writerow(input_row)
print(time.ctime())
Also, by using 'wt'
instead of 'wb'
one can avoid
"TypeError: a bytes-like object is required, not 'str'"
Upvotes: 0
Reputation: 46779
The following should work nicely. As mentioned, your regex usage was almost correct. It is possible to still use the Python CSV library to do the CSV processing by converting the found text into a StringIO object and passing that to the CSV reader:
import re
import csv
import time
import StringIO
print(time.ctime())
input_name = "C:/Users/User/Desktop/File with No Extension That's Very Similar to CSV"
output_name = "C:/Users/User/Desktop/testoutput.csv"
with open(input_name, 'r') as f_input, open(output_name, 'wb') as f_output:
# Read whole file in
all_input = f_input.read()
# Extract interesting lines
ab_input = re.findall(r'Point A(.*?)Point B', all_input, re.DOTALL)[0]
# Convert into a file object and parse using the CSV reader
fab_input = StringIO.StringIO(ab_input)
csv_input = csv.reader(fab_input)
csv_output = csv.writer(f_output)
# Iterate a row at a time from the input
for input_row in csv_input:
# Skip any empty rows
if input_row:
# Write row at a time to the output
csv_output.writerow(input_row)
print(time.ctime())
You have not given us an example from your CSV file, so if there are problems, you might need to configure the CSV 'dialect' to process it better.
Tested using Python 2.7
Upvotes: 3
Reputation: 1249
You have 2 problems here: the first is related to the regular expression and the other is about the list syntax.
Getting what you want
The way you are using the regular expression will return to you a list with a single value (all lines into an unique string).
Probably there is a better way of doing this but I would go now with something like this:
with open('bla', 'r') as input:
data = input.read()
x = re.findall(r'Point A(.*?)Point B', data, re.DOTALL)[0]
x = x.splitlines(False)[1:]
That's not pretty but will return a list with all values between those two points.
Working with lists
There is no rows
attribute inside lists. You just have to iterate over it:
for i in x:
do what you have to do
See, I'm not familiar to the csv
library but it looks that you will have to perform some manipulations to the i
value before adding it to the library.
IMHO, I would avoid using CSV format since it is kind of "locale dependent" so it may not work as expected depending the settings your end-users may have on OS.
Upvotes: 1