Reputation: 11
I am trying to put some kind of placeholder for certain words in my dataset. However, my method doesnt seem to do anything. I do not get an error but it also doesn't do what it is supposed to do. What am I doing wrong here?
CODE:
wordlist_urls =['co','https','http', 'www']
wordlist_news = ['nrc','volkskrant','ad', 'telegraaf', 'dagblad','courant']
wordlist_socials = ['twitter','instagram','linkedin', 'blog', 'twitteraccount']
wordlist_links = ['GroenLinks','sp','bij1', 'pvda', 'pvdd', 'DENK']
wordlist_rechts = ['FvD','VVD','PvdA', 'CDA', 'ja21', 'CU', 'SGP', 'Volt', 'bvnl']
wordlist_uni = ['uva','vu','rug', 'university', 'universiteit', 'Utrecht University', 'Leiden university', 'UU']
written_news['placeholders'] = written_news['user_description_clean'].replace(wordlist_urls,'URL')
written_news.loc['placeholders'] = written_news.loc['placeholders'].replace(wordlist_news,'NEWSPAPERS')
written_news.loc['placeholders'] = written_news.loc['placeholders'].replace(wordlist_socials,'SOCIALS')
written_news.loc['placeholders'] = written_news.loc['placeholders'].replace(wordlist_links,'POL_L')
written_news.loc['placeholders'] = written_news.loc['placeholders'].replace(wordlist_rechts,'POL_R')
written_news.loc['placeholders'] = written_news.loc['placeholders'].replace(wordlist_uni,'UNI')
written_news['placeholders']
I tried using the replace() method, I was expecting the words in the wordlist would show in the data as the newly defined word. However the words are still unchanged in the dataset.
Upvotes: 1
Views: 205
Reputation: 61
It is hard to provide a solution if you don't tell us how your data are formatted.
Looking at your other question here at StackOverflow, one issue could be because your column called user_description_clean
is a pandas series of lists (a list of lists). Whereby each row is a tokenized string, stored as a list of words in Python. Or perhaps it is just one string?
In any case you, you could consider making a function in which you search for the words via regular expressions. You can then use .apply()
and lambda: x
to replace the words in each row of your data frame.
It would look like this:
#import the packages
import pandas as pd
import re
#example mock-up data
written_news=pd.DataFrame({'user_description_clean': [["voorbeeld", "volkskrant", "achtuurjournaal", "telegraaf", "dagblad", "media"],
["courant","krantje", "dagblad", "nrc", "media"],
["nrc", "volkskrant", "algemeen", "dagblad", "NRC"],
["python", "pandas", "numpy", "big", "data"],
["python", "bs4", "spacy", "tensorflow"]]})
wordlist_news = ['nrc','volkskrant', 'telegraaf', 'dagblad','courant']
#create your function
def placeholder_maker(sentence, wordlist, placeholder):
sentence=" ".join(sentence) #only if your data are formatted as list of tokens. If your data is just a sentence, comment this line out.
for word in wordlist:
if word in sentence:
sentence=re.sub(word, placeholder, sentence)
return sentence.split() #Or return sentence if you don't want a tokenized sentence again
#run the function with .apply() and lambda
written_news['placeholder'] = written_news['user_description_clean'].apply(lambda row: placeholder_maker(sentence=row, wordlist=wordlist_news, placeholder="NEWSPAPERS"))
#print the result
print(written_news['placeholder'])
the output would look like this:
>>> print(written_news['placeholder'])
0 [voorbeeld, NEWSPAPERS, achtuurjournaal, NEWSP...
1 [NEWSPAPERS, krantje, NEWSPAPERS, NEWSPAPERS, ...
2 [NEWSPAPERS, NEWSPAPERS, algemeen, NEWSPAPERS,...
3 [python, pandas, numpy, big, data]
4 [python, bs4, spacy, tensorflow]
Name: placeholder, dtype: object
If you have another list, you just change the input for your arguments in the following way:
#second wordlist
wordlist_python =['python', 'pandas','spacy','tensorflow']
#update the placeholder column
written_news['placeholder'] = written_news['placeholder'].apply(lambda row: placeholder_maker(sentence=row, wordlist=wordlist_python, placeholder="MACHINELEARNING"))
#print the result
print(written_news['placeholder'])
which results in:
>>> print(written_news['placeholder'])
0 [voorbeeld, NEWSPAPERS, achtuurjournaal, NEWSP...
1 [NEWSPAPERS, krantje, NEWSPAPERS, NEWSPAPERS, ...
2 [NEWSPAPERS, NEWSPAPERS, algemeen, NEWSPAPERS,...
3 [MACHINELEARNING, MACHINELEARNING, numpy, big,...
4 [MACHINELEARNING, bs4, MACHINELEARNING, MACHIN...
Name: placeholder, dtype: object
But again, a minimal reproducible would be helpful as it helps to understand how your data are formatted in the first place.
Upvotes: 0
Reputation: 51
The inplace keyword can help here. A simple example:
import pandas as pd
df = pd.DataFrame({"A":[1,2,3,4], "B": ["foo1","foo2","foo3", "bar"]})
foos = ["foo1","foo2","foo3"]
df["B"].replace(foos, "foo", inplace=True)
Printing df will return:
>>print(df)
A B
0 1 foo
1 2 foo
2 3 foo
3 4 bar
Upvotes: 0