Reputation: 57
I have the following data in the form of a text file which I would like to load into python:
pclass survived name
0 1 1 Allen, Miss. Elisabeth Walton
1 1 1 Allison, Master. Hudson Trevor
2 1 0 Allison, Miss. Helen Loraine
3 1 0 Allison, Mr. Hudson Joshua Creighton
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
5 1 1 Anderson, Mr. Harry
6 1 1 Andrews, Miss. Kornelia Theodosia
7 1 0 Andrews, Mr. Thomas Jr
8 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson)
9 1 0 Artagaveytia, Mr. Ramon
10 1 0 Astor, Col. John Jacob
Since the white space is not a constant and also since the last field(name) has a white space between them, I am having trouble parsing it. I tried the following:
pd.read_csv("test.csv",sep = "\s+", header=0, index_col=0)
But it gives an error:
CParserError: Error tokenizing data. C error: Expected 7 fields in line 5, saw 8
Upvotes: 3
Views: 785
Reputation: 49812
You can use pandas.read_fwf
(aka: fixed width format) to do this:
Code:
df = pd.read_fwf(StringIO(data), header=1, index_col=0)
Test code:
from io import StringIO
import pandas as pd
data = u"""
pclass survived name
0 1 1 Allen, Miss. Elisabeth Walton
1 1 1 Allison, Master. Hudson Trevor
2 1 0 Allison, Miss. Helen Loraine
3 1 0 Allison, Mr. Hudson Joshua Creighton
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
5 1 1 Anderson, Mr. Harry
6 1 1 Andrews, Miss. Kornelia Theodosia
7 1 0 Andrews, Mr. Thomas Jr
8 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson)
9 1 0 Artagaveytia, Mr. Ramon
10 1 0 Astor, Col. John Jacob"""
df = pd.read_fwf(StringIO(data), header=1, index_col=0)
print(df)
Results:
pclass survived name
0 1 1 Allen, Miss. Elisabeth Walton
1 1 1 Allison, Master. Hudson Trevor
2 1 0 Allison, Miss. Helen Loraine
3 1 0 Allison, Mr. Hudson Joshua Creighton
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
5 1 1 Anderson, Mr. Harry
6 1 1 Andrews, Miss. Kornelia Theodosia
7 1 0 Andrews, Mr. Thomas Jr
8 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson)
9 1 0 Artagaveytia, Mr. Ramon
10 1 0 Astor, Col. John Jacob
Upvotes: 2
Reputation: 294498
'\s+'
assumes one or more spaces which still parses your final column. Instead use a regex that assumes two or more.
pd.read_csv("test.csv", sep="\s{2,}", header=0, index_col=0, engine='python')
Entire Working Example
from io import StringIO
import pandas as pd
txt = """ pclass survived name
0 1 1 Allen, Miss. Elisabeth Walton
1 1 1 Allison, Master. Hudson Trevor
2 1 0 Allison, Miss. Helen Loraine
3 1 0 Allison, Mr. Hudson Joshua Creighton
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
5 1 1 Anderson, Mr. Harry
6 1 1 Andrews, Miss. Kornelia Theodosia
7 1 0 Andrews, Mr. Thomas Jr
8 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson)
9 1 0 Artagaveytia, Mr. Ramon
10 1 0 Astor, Col. John Jacob
"""
pd.read_csv(StringIO(txt), sep="\s{2,}", header=0, index_col=0, engine='python')
pclass survived name
0 1 1 Allen, Miss. Elisabeth Walton
1 1 1 Allison, Master. Hudson Trevor
2 1 0 Allison, Miss. Helen Loraine
3 1 0 Allison, Mr. Hudson Joshua Creighton
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
5 1 1 Anderson, Mr. Harry
6 1 1 Andrews, Miss. Kornelia Theodosia
7 1 0 Andrews, Mr. Thomas Jr
8 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson)
9 1 0 Artagaveytia, Mr. Ramon
10 1 0 Astor, Col. John Jacob
Upvotes: 3