Reputation: 25
I have to extract a dollar amount from one column of a csv. The dollar amount is surrounded by zeros.
1. 0000000000565400.60000000008289.116000
2. 0000000466175218.82000043978794.228000
3. 0000000000005720.77000000000198.431000
From these three lines, I'd be looking to pull 565400.60, 466175218.82, 5720.77
. I don't want/care about the 8289.116, 43978794.228, 198.431
.
The dtype
is a non-null object. I've tried converting it to a string and extracting the first 20 characters. I've also tried replacing several zeros with nothing. Obviously I can't get rid of all the zeros because of numbers like 565400.60
where I need the zeros. Even if I could get it converted to a dtype I wanted, I would have no idea how to pull the differing amounts of zeros.
dollars = str(dollars)
dollars = dollars.str.replace('0000000000','')
grab = dollars['Dollars'].astype(str).str[0:20]
Expected:
Dollars
------------
1. 565400.60
2. 466175218.82
3. 5720.77
Some of my errors:
TypeError: string indices must be integers
AttributeError: 'str' object has no attribute 'str'
Upvotes: 2
Views: 225
Reputation: 134
You could use regex with a pattern something like ^0+(\d+.\d{2})0+ (assuming it's always two numbers after the decimal). You can test the pattern on https://regex101.com/. Here's the code:
import re
string = ['0000000000565400.60000000008289.116000',
'0000000466175218.82000043978794.228000',
'0000000000005720.77000000000198.431000']
pattern = r'0+(?P<first_number>\d+.\d{2})0+'
for line in string:
matches = re.search(pattern, str(line))
print(matches['first_number'])
# 565400.60
Also if you're working with a block of text:
string = '''0000000000565400.60000000008289.116000\n
0000000466175218.82000043978794.228000\n
0000000000005720.77000000000198.431000\n'''
pattern = r'0+(\d+.\d{2})0+'
matches = re.findall(pattern, string, re.DOTALL)
print(matches)
# ['565400.60', '466175218.82', '5720.77']
Upvotes: 0
Reputation: 150785
You can try some regex
and .str.extract
on the series:
df.Dollars.str.extract(r'0+(\d+\.\d\d)')
Output:
0
0 565400.60
1 466175218.82
2 5720.77
Or if you like the other part as well:
df.Dollars.str.extractall(r'0+(\d+\.\d\d)0+(\d+\.\d+)$'))
output:
0 1
match
0 0 565400.60 8289.116000
1 0 466175218.82 43978794.228000
2 0 5720.77 198.431000
Upvotes: 1
Reputation: 27321
A pandas
solution to your problem might look something like:
>>> dollars = pd.Series(['0000000000565400.60000000008289.116000',
... '0000000466175218.82000043978794.228000',
... '0000000000005720.77000000000198.431000'])
>>> dollars.str[:19].astype(float).astype(str)
0 565400.6
1 466175218.82
2 5720.77
dtype: object
>>>
Upvotes: 1
Reputation: 77860
Since you know that the amounts are in dollars and cents, you know that you have two decimal places. Simply grab that much of the input line and strip off the extra zeroes.
line = "1. 0000000000565400.60000000008289.116000"
float_pair = line.split()[1]
decpt = float_pair.find('.')
amt_str = float_pair[:dec_pt+3] # one extra for the decimal point
amt_no_zero = amt_str.lstrip('0')
Result:
dec_pt = 16
amt_no_zero = '565400.60'
Upvotes: 0
Reputation: 27321
Your data looks suspiciously like a fixed length record format, which means you can use string subscripts to get at the raw data:
>>> line = '0000000000565400.60000000008289.116000'
>>> first_number = line[:20] # or :19 if you only want 2 digits after the dot
>>> first_number
'0000000000565400.600'
Since you're working with money amounts, the decimal.Decimal
class is good to use for extracting the amount:
>>> from decimal import Decimal
>>> Decimal(first_number)
Decimal('565400.600')
if you need it as a string, then it's simply:
>>> str(Decimal(first_number))
'565400.600'
Upvotes: 0
Reputation: 346
You can use the built-in string 'find' function, then subset the string given the position of the first decimal point (plus two positions for cents). Then cast it as an integer or float, however you'd like:
s = '00000123000.0000123000.000'
pos = s.find('.')
number = float(s[:pos+2])
This should work so long as your objects are strings or can be cast as strings.
Upvotes: 0
Reputation: 51683
This should find all your numbers just fine - caveat: only 2 decimals possible due to fixed position after .
for 1st number:
text = """
0000000000565400.60000000008289.116000
0000000466175218.82000043978794.228000
0000000000005720.77000000000198.431000
"""
for line in text.split():
a = (line[:line.index(".")+3]) # find first ., take 2 digits after
b = float(line[len(a):]) # take remainder after first find and conver to float
a = float(a) # convert a to float (needed string for len())
print(a,b)
Output:
565400.6 8289.116
466175218.82 43978794.228
5720.77 198.431
Upvotes: 0
Reputation: 2992
Let's assume, you want first number up to dot and two digits after. Then:
i = grab.find('.')
if i >= 0:
grab = grab[:i + 3].lstrip('0')
will produce what you want.
Upvotes: 0
Reputation: 612
Just do something along the lines of:
Search for first non-zero character index (i0).
Search for first . index (i1).
Number you want = string[i0: i1+3]
^This is all pythonic - +3 b.c. +1 extra, and 2 digits for the cents which seems to be what you want. This also assumes you always want the first number only.
Upvotes: 0