Maestra
Maestra

Reputation: 11

How to transform integer that follow another word into its string value?

I am trying to write a function that transforms an integer into its string value only if this integer follows certains words. So, I want all the numbers that follow only words such as "hours", "hour", "day", "days", "minutes" to be transformed into their string value, otherwise, kept the same.

So for example, i have this : "I am 45, I came here 4 times and I have been waiting for 6 hours." The result should be : "I am 45, I came here four times and I have been waiting for six hours."

I tried to write a code for that but i am stuck at some point:

  1. I am able to get the result in the previous case, but when i have something like: "I am 45, I came here 4 times and I have been waiting for 45 hours.", my code returns "I am forty-five , I came here 4 times and I have been waiting for forty-five hours." while i don't want the first "45" to be changed.

  2. When i test my code with a single sentence it works, but when i use an entire dataframe column with the map function, it's not working. Here is my code and the error i get.

    
    import pandas as pd
    from num2words import num2words
    import re
    
    text = [[1, "I am writing some very basic english sentences"],
               [2, " i am 45 old and worked 3 times this week for 45 hours " ],
                [3, " i am 75 old and worked 6 times this week for 45 hours "]]
    
    Data = pd.DataFrame(raw_docs, columns=["index", "text"])
    Data
    
    def remove_numbers(text):
        m = re.findall('\d+\s(?=hour|day|days|hours|hrs|hr|minutes|min|time|times)', text)
        for i in range(len(m)):
            if m[i]:
                t = m[i]
                t2 = num2words(t) 
                clean = re.sub(t, t2+' ', text)
                text = clean
        return clean
    
    Data['text'] = pd.DataFrame(Data['text'].map(remove_numbers))
    Data['text']
    
    

The error i get:

    ---------------------------------------------------------------------------
    UnboundLocalError                         Traceback (most recent call last)
    <ipython-input-165-b46ce833010e> in <module>
         16     return clean
         17 
    ---> 18 Data['text'] = pd.DataFrame(Data['text'].map(remove_numbers))
         19 Data['text']
    
    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in map(self, arg, na_action)
       3907         dtype: object
       3908         """
    -> 3909         new_values = super()._map_values(arg, na_action=na_action)
       3910         return self._constructor(new_values, index=self.index).__finalize__(
       3911             self, method="map"
    
    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
        935 
        936         # mapper is a function
    --> 937         new_values = map_f(values, mapper)
        938 
        939         return new_values
    
    pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
    
    <ipython-input-165-b46ce833010e> in remove_numbers(text)
         14             clean = re.sub(t, t2+' ', text)
         15             text = clean
    ---> 16     return clean
         17 
         18 Data['text'] = pd.DataFrame(Data['text'].map(remove_numbers))
    
    UnboundLocalError: local variable 'clean' referenced before assignment

Please, can someone help me solve those 2 issues ?

Upvotes: 0

Views: 44

Answers (1)

Ryan Rau
Ryan Rau

Reputation: 148

The last error is whats getting you. In your example text[0][1] has no matches for m so it returns clean before it has been set to anything.

try:

def remove_numbers(text):
    m = re.findall('\d+\s(?=hour|day|days|hours|hrs|hr|minutes|min|time|times)', text)
    clean = text
    for i in range(len(m)):
        if m[i]:
            t = m[i]
            t2 = num2words(t) 
            clean = re.sub(t, t2+' ', text)
            text = clean
    return clean

UPDATE

Forgot about the first part of the question, you'll need to apply the regex when substituting the new value. When you search for 45 in the case of text[1][1] it's replacing both instances.

try:

def remove_numbers(text):
    clean = text
    m = re.findall('\d+\s(?=hour|day|days|hours|hrs|hr|minutes|min|time|times)', text)
    print(m)
    for i in range(len(m)):
        if m[i]:
            t = m[i]
            t2 = num2words(t) 
            pattern = '[' + t + ']+\s(?=hour|day|days|hours|hrs|hr|minutes|min|time|times)'
            clean = re.sub(pattern, ' '+ t2 + ' ', text)
            text = clean
    return clean

Upvotes: 1

Related Questions