python: import flat text file without delimiters

Question

I want to import into Python a text file extracted from some database. It is a flat text format, without end of line separators (but I know there are supposed to be a fixed number of columns). Each new line is identified with an incremented id ("0001", "0002", "0003" in the example below).

I tried different methods, eventually this one:

with open('url.txt', "rb") as f:
    df = f.read().decode(errors="replace")

But this gives me a gigantic string… I then tried some regex to split on the Id with a loop and then sub-split on ",", the problem is that sometimes missing data is coded \N without quotes and it never returns the same number of columns per row. Sample of data:

"0001","2015-01-01","doc","eab4e80fec7352a7","https://www.paypal.com/us","setRequestHeader(\"Content-Type\")","0002","2015-01-02","doc","0",\N,\N,"0003",etc.

the expected output should be a pandas dataframe with columns: id, date, doctype, hash, url, code. Any idea on how I can do that?

Axel Puig · Accepted Answer

To get you dataframe, you can do somthing like this:

with open('testfloat', "rb") as f:
    df = f.read().decode(errors="replace")
df = df.replace('\N', '""')  # Replace \N by empty strings
df = df[1:-1]  # remove first and last "
df_list = df.split('","')  # Splitting values

array = [df_list[i:i+6] for i in range(0, len(df_list), 6)]  # Extract the lines

df = pd.DataFrame(array)

python: import flat text file without delimiters

Answers (1)

Related Questions