David
David

Reputation: 65

Extracting Rows of Data from a CSV-like File Using Python

I have a large file from a proprietary archive format. Unzipping this archive gives a file that has no extension, but the data inside is comma-delimited. Adding a .csv extension or simply opening the file with Excel will work.

I have about 375-400 of these files, and I'm trying to extract a chunk of rows (about 13,500 out of 1.2M+ rows) between a keyword "Point A" and another keyword "Point B".

I found some code on this site that I think is extracting the data correctly, but I'm getting an error:

AttributeError: 'list' object has no attribute 'rows' when trying to save out the file. Can somebody help me get this data to save into a csv?

import re
import csv
import time

print(time.ctime())

file = open('C:/Users/User/Desktop/File with No Extension That\'s Very Similar to CSV', 'r')
data = file.read()
x = re.findall(r'Point A(.*?)Point B', data,re.DOTALL)

name = "C:/Users/User/Desktop/testoutput.csv"
with open(name, 'w', newline='') as file2:
    savefile = csv.writer(file2)
    for i in x.rows:
        savefile.writerow([cell.value for cell in i])

print(time.ctime())

Thanks in advance, any help would be much appreciated.

Upvotes: 0

Views: 741

Answers (3)

Ashwaq
Ashwaq

Reputation: 459

Updating the code so that @Martin Evans answer works on the latest Python version.

import re
import csv
import time
import io

print(time.ctime())

input_name = "C:/Users/User/Desktop/File with No Extension That's Very Similar to CSV"
output_name = "C:/Users/User/Desktop/testoutput.csv"

with open(input_name, 'r') as f_input, open(output_name, 'wt') as f_output:
    # Read whole file in
    all_input = f_input.read()  

    # Extract interesting lines
    ab_input = re.findall(r'Point A(.*?)Point B', all_input, re.DOTALL)[0]

    # Convert into a file object and parse using the CSV reader
    fab_input = io.StringIO(ab_input)
    csv_input = csv.reader(fab_input)
    csv_output = csv.writer(f_output)

    # Iterate a row at a time from the input
    for input_row in csv_input:
        # Skip any empty rows
        if input_row:
            # Write row at a time to the output
            csv_output.writerow(input_row)

print(time.ctime())

Also, by using 'wt' instead of 'wb' one can avoid

"TypeError: a bytes-like object is required, not 'str'"

Upvotes: 0

Martin Evans
Martin Evans

Reputation: 46779

The following should work nicely. As mentioned, your regex usage was almost correct. It is possible to still use the Python CSV library to do the CSV processing by converting the found text into a StringIO object and passing that to the CSV reader:

import re
import csv
import time
import StringIO

print(time.ctime())

input_name = "C:/Users/User/Desktop/File with No Extension That's Very Similar to CSV"
output_name = "C:/Users/User/Desktop/testoutput.csv"

with open(input_name, 'r') as f_input, open(output_name, 'wb') as f_output:
    # Read whole file in
    all_input = f_input.read()  

    # Extract interesting lines
    ab_input = re.findall(r'Point A(.*?)Point B', all_input, re.DOTALL)[0]

    # Convert into a file object and parse using the CSV reader
    fab_input = StringIO.StringIO(ab_input)
    csv_input = csv.reader(fab_input)
    csv_output = csv.writer(f_output)

    # Iterate a row at a time from the input
    for input_row in csv_input:
        # Skip any empty rows
        if input_row:
            # Write row at a time to the output
            csv_output.writerow(input_row)

print(time.ctime())

You have not given us an example from your CSV file, so if there are problems, you might need to configure the CSV 'dialect' to process it better.

Tested using Python 2.7

Upvotes: 3

Fabio Menegazzo
Fabio Menegazzo

Reputation: 1249

You have 2 problems here: the first is related to the regular expression and the other is about the list syntax.

  1. Getting what you want

    The way you are using the regular expression will return to you a list with a single value (all lines into an unique string).

    Probably there is a better way of doing this but I would go now with something like this:

    with open('bla', 'r') as input:
        data = input.read()
        x = re.findall(r'Point A(.*?)Point B', data, re.DOTALL)[0]
        x = x.splitlines(False)[1:]
    

    That's not pretty but will return a list with all values between those two points.

  2. Working with lists

    There is no rows attribute inside lists. You just have to iterate over it:

    for i in x:
        do what you have to do
    

    See, I'm not familiar to the csv library but it looks that you will have to perform some manipulations to the i value before adding it to the library.

IMHO, I would avoid using CSV format since it is kind of "locale dependent" so it may not work as expected depending the settings your end-users may have on OS.

Upvotes: 1

Related Questions