Reputation: 8377
I want to import into Python a text file extracted from some database. It is a flat text format, without end of line separators (but I know there are supposed to be a fixed number of columns).
Each new line is identified with an incremented id ("0001"
, "0002"
, "0003"
in the example below).
I tried different methods, eventually this one:
with open('url.txt', "rb") as f:
df = f.read().decode(errors="replace")
But this gives me a gigantic string… I then tried some regex to split on the Id with a loop and then sub-split on ","
, the problem is that sometimes missing data is coded \N
without quotes and it never returns the same number of columns per row. Sample of data:
"0001","2015-01-01","doc","eab4e80fec7352a7","https://www.paypal.com/us","setRequestHeader(\"Content-Type\")","0002","2015-01-02","doc","0",\N,\N,"0003",etc.
the expected output should be a pandas dataframe with columns: id, date, doctype, hash, url, code. Any idea on how I can do that?
Upvotes: 0
Views: 1079
Reputation: 1334
To get you dataframe, you can do somthing like this:
with open('testfloat', "rb") as f:
df = f.read().decode(errors="replace")
df = df.replace('\\N', '""') # Replace \N by empty strings
df = df[1:-1] # remove first and last "
df_list = df.split('","') # Splitting values
array = [df_list[i:i+6] for i in range(0, len(df_list), 6)] # Extract the lines
df = pd.DataFrame(array)
Upvotes: 2