Reputation: 25

Is there a way I can extract numbers from a string surrounded by differing amounts zeros on both sides?

I have to extract a dollar amount from one column of a csv. The dollar amount is surrounded by zeros.

Dollars

1.   0000000000565400.60000000008289.116000 
2.   0000000466175218.82000043978794.228000
3.   0000000000005720.77000000000198.431000

From these three lines, I'd be looking to pull 565400.60, 466175218.82, 5720.77. I don't want/care about the 8289.116, 43978794.228, 198.431.

The dtype is a non-null object. I've tried converting it to a string and extracting the first 20 characters. I've also tried replacing several zeros with nothing. Obviously I can't get rid of all the zeros because of numbers like 565400.60 where I need the zeros. Even if I could get it converted to a dtype I wanted, I would have no idea how to pull the differing amounts of zeros.

dollars = str(dollars)
dollars = dollars.str.replace('0000000000','')

grab = dollars['Dollars'].astype(str).str[0:20]

Expected:

Dollars
------------
1. 565400.60
2. 466175218.82
3. 5720.77

Some of my errors:

TypeError: string indices must be integers
AttributeError: 'str' object has no attribute 'str'

Upvotes: 2

Answers (10)

Ryan Stefan

Reputation: 134

You could use regex with a pattern something like ^0+(\d+.\d{2})0+ (assuming it's always two numbers after the decimal). You can test the pattern on https://regex101.com/. Here's the code:

import re

string = ['0000000000565400.60000000008289.116000',
          '0000000466175218.82000043978794.228000',
          '0000000000005720.77000000000198.431000']

pattern = r'0+(?P<first_number>\d+.\d{2})0+'
for line in string:
    matches = re.search(pattern, str(line))
    print(matches['first_number'])


# 565400.60

Also if you're working with a block of text:

string = '''0000000000565400.60000000008289.116000\n
          0000000466175218.82000043978794.228000\n
          0000000000005720.77000000000198.431000\n'''

pattern = r'0+(\d+.\d{2})0+'
matches = re.findall(pattern, string, re.DOTALL)
print(matches)

# ['565400.60', '466175218.82', '5720.77']

Upvotes: 0

Quang Hoang

Reputation: 150785

You can try some regex and .str.extract on the series:

df.Dollars.str.extract(r'0+(\d+\.\d\d)')

Output:

    0
0   565400.60
1   466175218.82
2   5720.77

Or if you like the other part as well:

df.Dollars.str.extractall(r'0+(\d+\.\d\d)0+(\d+\.\d+)$'))

output:

                    0                1
  match                               
0 0         565400.60      8289.116000
1 0      466175218.82  43978794.228000
2 0           5720.77       198.431000

Upvotes: 1

thebjorn

Reputation: 27321

A pandas solution to your problem might look something like:

>>> dollars = pd.Series(['0000000000565400.60000000008289.116000',
...                      '0000000466175218.82000043978794.228000',
...                      '0000000000005720.77000000000198.431000'])
>>> dollars.str[:19].astype(float).astype(str)
0        565400.6
1    466175218.82
2         5720.77
dtype: object
>>>

Upvotes: 1

Prune

Reputation: 77860

Since you know that the amounts are in dollars and cents, you know that you have two decimal places. Simply grab that much of the input line and strip off the extra zeroes.

line = "1.   0000000000565400.60000000008289.116000"
float_pair = line.split()[1]
decpt = float_pair.find('.')
amt_str = float_pair[:dec_pt+3]   # one extra for the decimal point
amt_no_zero = amt_str.lstrip('0')

Result:

dec_pt = 16
amt_no_zero = '565400.60'

Upvotes: 0

thebjorn

Reputation: 27321

Your data looks suspiciously like a fixed length record format, which means you can use string subscripts to get at the raw data:

>>> line = '0000000000565400.60000000008289.116000'
>>> first_number = line[:20]  # or :19 if you only want 2 digits after the dot
>>> first_number
'0000000000565400.600'

Since you're working with money amounts, the decimal.Decimal class is good to use for extracting the amount:

>>> from decimal import Decimal
>>> Decimal(first_number)
Decimal('565400.600')

if you need it as a string, then it's simply:

>>> str(Decimal(first_number))
'565400.600'

Upvotes: 0

Jeff

Reputation: 346

You can use the built-in string 'find' function, then subset the string given the position of the first decimal point (plus two positions for cents). Then cast it as an integer or float, however you'd like:

s = '00000123000.0000123000.000'
pos = s.find('.')
number = float(s[:pos+2])

This should work so long as your objects are strings or can be cast as strings.

Upvotes: 0

Patrick Artner

Reputation: 51683

This should find all your numbers just fine - caveat: only 2 decimals possible due to fixed position after . for 1st number:

text = """
0000000000565400.60000000008289.116000
0000000466175218.82000043978794.228000
0000000000005720.77000000000198.431000
"""

for line in text.split(): 
    a = (line[:line.index(".")+3])  # find first ., take 2 digits after
    b = float(line[len(a):])        # take remainder after first find and conver to float
    a = float(a)                    # convert a to float (needed string for len())
    print(a,b)

Output:

565400.6 8289.116
466175218.82 43978794.228
5720.77 198.431

Upvotes: 0

Radosław Cybulski

Reputation: 2992

Let's assume, you want first number up to dot and two digits after. Then:

i = grab.find('.')
if i >= 0:
    grab = grab[:i + 3].lstrip('0')

will produce what you want.

Upvotes: 0

Evan Mata

Reputation: 612

Just do something along the lines of:

Search for first non-zero character index (i0).

Search for first . index (i1).

Number you want = string[i0: i1+3]

^This is all pythonic - +3 b.c. +1 extra, and 2 digits for the cents which seems to be what you want. This also assumes you always want the first number only.

Upvotes: 0

yatu

Reputation: 88275

You could use str.index to index using the position of the first .:

s = '0000000000565400.60000000008289.116000 '
s[:s.index('.')+3]
# '565400.60'

Upvotes: 3

Is there a way I can extract numbers from a string surrounded by differing amounts zeros on both sides?

Dollars

Answers (10)

Related Questions