Reputation: 1396
I have an example file of the type of file I am working with.
Sometimes the delimiter is |
and other times it's *
.
The closest I have come to successfully reading this is:
data = pd.read_csv('filename',
skiprows=244, sep=r'\|',
header=None,
names=['A','B','C','D','E'])
new = data['E']
df = new.str.split().apply(lambda x: pd.Series(x))
df = df.convert_objects(convert_numeric=True)
But the problem with this is that I end up with None
rows where my delimiter was a *
.
Is there a way to separate this file based on two different delimiters? I have thought about using a regular expression to match either |
or *
but
data = pd.read_csv('filename',
skiprows=244, nrows=5, sep=r'(\| | \*)',
header=None,
names=['A','B','C','D'])
doesn't seem to work either. I am not the greatest at reg expressions though.
EDIT:
I have also tried using sep=r'\s*'
, but seeing as how the *
's move around in the file, it makes it to where some rows are shifted, and therefore the columns do not align.
Upvotes: 1
Views: 1071
Reputation: 78733
Those asterisks everywhere are really problematic because, as you mention, sometimes they replace the |
delimiter and other times they're in the middle of a field seemingly at random.
Luckily your file seems to be fixed width, which pandas knows how to handle with read_fwf
(fixed-width file).
Let's take a little subset of this nasty data file which shows both asterisk positions:
| * | | 0.1108 0.0085 ( 1.883%) 0.1066 ( 0.504%)
| | | | 0.1112 0.0001 (20.851%) 0.1066 ( 0.504%)
| | | | 0.1116 0.0005 ( 7.536%) 0.1072 ( 0.502%)
| * | | | 0.1120 0.0059 ( 2.266%) 0.1130 ( 0.488%)
| | | | 0.1124 0.0001 (16.439%) 0.1131 ( 0.487%)
We can read this with some careful (but boring) counting of character positions:
In [9]: colspecs = [(64, 71), (76, 83), (85, 90), (97, 104), (106, 111)]
In [11]: pd.read_fwf('my_file.csv', colspecs=colspecs, header=None)
Out[11]:
0 1 2 3 4
0 0.1108 0.0085 1.883 0.1066 0.504
1 0.1112 0.0001 0.851 0.1066 0.504
2 0.1116 0.0005 7.536 0.1072 0.502
3 0.1120 0.0059 2.266 0.1130 0.488
4 0.1124 0.0001 6.439 0.1131 0.487
Upvotes: 1
Reputation: 1156
with open('C:/example_file.txt') as f:
content = [x.strip('\n') for x in f.readlines()]
content.pop(0)
cleancontent = []
for index, item in enumerate(content):
cleancontent.append(content[index][62:])
cleancontent
Then turn this into the dataframe.
Upvotes: 0
Reputation: 28683
You can go the old fashioned way, and build a list:
rows = []
for line in open('filename'):
var1 = int(line[62:70])
....
rows.append([var1, ...])
data = pd.DataFrame(data=rows, columns=['A','B','C','D'])
Upvotes: 0