smartypants97
smartypants97

Reputation: 11

trying to replace words in dataset (DataFrame)

I am trying to put some kind of placeholder for certain words in my dataset. However, my method doesnt seem to do anything. I do not get an error but it also doesn't do what it is supposed to do. What am I doing wrong here?

CODE:

wordlist_urls =['co','https','http', 'www']
wordlist_news = ['nrc','volkskrant','ad', 'telegraaf', 'dagblad','courant']
wordlist_socials = ['twitter','instagram','linkedin', 'blog', 'twitteraccount']
wordlist_links = ['GroenLinks','sp','bij1', 'pvda', 'pvdd', 'DENK']
wordlist_rechts = ['FvD','VVD','PvdA', 'CDA', 'ja21', 'CU', 'SGP', 'Volt', 'bvnl']
wordlist_uni = ['uva','vu','rug', 'university', 'universiteit', 'Utrecht University', 'Leiden university', 'UU']

written_news['placeholders'] = written_news['user_description_clean'].replace(wordlist_urls,'URL')
written_news.loc['placeholders'] = written_news.loc['placeholders'].replace(wordlist_news,'NEWSPAPERS')
written_news.loc['placeholders'] = written_news.loc['placeholders'].replace(wordlist_socials,'SOCIALS')
written_news.loc['placeholders'] = written_news.loc['placeholders'].replace(wordlist_links,'POL_L')
written_news.loc['placeholders'] = written_news.loc['placeholders'].replace(wordlist_rechts,'POL_R')
written_news.loc['placeholders'] = written_news.loc['placeholders'].replace(wordlist_uni,'UNI')

written_news['placeholders']

I tried using the replace() method, I was expecting the words in the wordlist would show in the data as the newly defined word. However the words are still unchanged in the dataset.

Upvotes: 1

Views: 205

Answers (2)

Thomas Frissen
Thomas Frissen

Reputation: 61

It is hard to provide a solution if you don't tell us how your data are formatted.

Looking at your other question here at StackOverflow, one issue could be because your column called user_description_clean is a pandas series of lists (a list of lists). Whereby each row is a tokenized string, stored as a list of words in Python. Or perhaps it is just one string?

In any case you, you could consider making a function in which you search for the words via regular expressions. You can then use .apply() and lambda: x to replace the words in each row of your data frame.

It would look like this:

#import the packages
import pandas as pd
import re

#example mock-up data
written_news=pd.DataFrame({'user_description_clean': [["voorbeeld", "volkskrant", "achtuurjournaal", "telegraaf", "dagblad", "media"],
                                                      ["courant","krantje", "dagblad", "nrc", "media"],
                                                      ["nrc", "volkskrant", "algemeen", "dagblad", "NRC"],
                                                      ["python", "pandas", "numpy", "big", "data"],
                                                      ["python", "bs4", "spacy", "tensorflow"]]})

wordlist_news = ['nrc','volkskrant', 'telegraaf', 'dagblad','courant']

#create your function
def placeholder_maker(sentence, wordlist, placeholder):
    sentence=" ".join(sentence) #only if your data are formatted as list of tokens. If your data is just a sentence, comment this line out.
    for word in wordlist:
        if word in sentence:
            sentence=re.sub(word, placeholder, sentence)
    return sentence.split() #Or return sentence if you don't want a tokenized sentence again

#run the function with .apply() and lambda 
written_news['placeholder'] = written_news['user_description_clean'].apply(lambda row: placeholder_maker(sentence=row, wordlist=wordlist_news, placeholder="NEWSPAPERS"))

#print the result
print(written_news['placeholder'])

the output would look like this:

>>> print(written_news['placeholder'])
0    [voorbeeld, NEWSPAPERS, achtuurjournaal, NEWSP...
1    [NEWSPAPERS, krantje, NEWSPAPERS, NEWSPAPERS, ...
2    [NEWSPAPERS, NEWSPAPERS, algemeen, NEWSPAPERS,...
3                   [python, pandas, numpy, big, data]
4                     [python, bs4, spacy, tensorflow]
Name: placeholder, dtype: object

If you have another list, you just change the input for your arguments in the following way:

#second wordlist
wordlist_python =['python', 'pandas','spacy','tensorflow']

#update the placeholder column
written_news['placeholder'] = written_news['placeholder'].apply(lambda row: placeholder_maker(sentence=row, wordlist=wordlist_python, placeholder="MACHINELEARNING")) 

#print the result   
print(written_news['placeholder'])

which results in:

>>> print(written_news['placeholder'])
0    [voorbeeld, NEWSPAPERS, achtuurjournaal, NEWSP...
1    [NEWSPAPERS, krantje, NEWSPAPERS, NEWSPAPERS, ...
2    [NEWSPAPERS, NEWSPAPERS, algemeen, NEWSPAPERS,...
3    [MACHINELEARNING, MACHINELEARNING, numpy, big,...
4    [MACHINELEARNING, bs4, MACHINELEARNING, MACHIN...
Name: placeholder, dtype: object

But again, a minimal reproducible would be helpful as it helps to understand how your data are formatted in the first place.

Upvotes: 0

Tim Maier
Tim Maier

Reputation: 51

The inplace keyword can help here. A simple example:

import pandas as pd
df = pd.DataFrame({"A":[1,2,3,4], "B": ["foo1","foo2","foo3", "bar"]})
foos = ["foo1","foo2","foo3"]
df["B"].replace(foos, "foo", inplace=True)

Printing df will return:

>>print(df)  
 A    B
0  1  foo
1  2  foo
2  3  foo
3  4  bar

Upvotes: 0

Related Questions