Reputation:
I would need to manage a data frame by adding more columns. My sample of data headers is
`Date` `Sentence`
28 Jan who.c
30 Jan house.a
02 Feb eurolet.it
I would need to add another column, Tp
, that for each link assigns a value:
a
then assign apartment
; if it ends with b
then assign bungalow
and so on as shown in original
;if a sentence ends with UK
then assign United Kingdom
; if it ends with IT
then assign Italy
, and so on. Values are from country
.
I would expect something like this:
Date Sentence Tp 28 Jan who.c church 30 Jan house.a apartment 02 Feb eurolet.it. Italy
I wrote the following:
conditions = [df['Sentence'].str.endswith(original), df['Sentence'].str.endswith(country)]
choices = [original, country]
# df['Tp'] = df.apply(lambda row: urlparse(row['Sentence']).netloc, axis = 1)
df['Tp'] = np.select(conditions, choices, default ='Unknown')
print(df)
where
original= [('a', 'apartment'), ('b', 'bungalow'), ('c', 'church')]
and
country = [('UK', 'United Kingdom'), ('IT', 'Italy'), ('DE', 'Germany'), ('H', 'Holland'), ..., ('F', 'France'), ('S', 'Spain')]
country
contains more than 50 elements.
Could you tell me how to fix it? The column should be added in the data frame, then to a csv file.
Thanks
Update:
Sentences \
0
1 who.c
2 citta.me.it
3 office.of
4 eurolet.eu
.. ...
995 uilpa.ie
996 fog.de
Original and country are from
list_country=np.array(country).tolist()
list_country_name=np.array(country_name).tolist()
flat_name_country = [item for sublist in list_country for item in sublist]
flat_country_name = [item for sublist in list_country_name for item in sublist]
zip_domains=list(zip(flat_name_country, flat_country_name))
Upvotes: 1
Views: 81
Reputation: 23099
First, lets make some dictionaries from your tuples and combine them
country = {k.lower() : v for (k,v) in country}
og = {k : v for (k,v) in original}
country.update(og)
print(country)
{'uk': 'United Kingdom',
'it': 'Italy',
'de': 'Germany',
'h': 'Holland',
'f': 'France',
's': 'Spain',
'a': 'apartment',
'b': 'bungalow',
'c': 'church'}
then lets split and get the max element - this allows for any full stops in your text to be ignored, only looking at the final element. finally, we use .map
to associate your values.
df['value'] = df["Sentence"].str.split(".", expand=True).stack().reset_index(1).query(
"level_1 == level_1.max()"
)[0].map(country)
print(df)
Date Sentence value
0 28 Jan who.c church
1 30 Jan house.a apartment
2 02 Feb eurolet.it Italy
Upvotes: 0
Reputation: 4044
Can you convert your original
and country
into dict ?
original= [('a', 'apartment'), ('b', 'bungalow'), ('c', 'church')]
original = {x:y for x,y in original}
country = [('UK', 'United Kingdom'), ('IT', 'Italy'), ('DE', 'Germany'), ('H', 'Holland'), ..., ('F', 'France'), ('S', 'Spain')]
country = {x:y for x,y in country}
Now you can perform the same task as :
df['Tp'] = df['Sentence'].apply(lambda sen : original.get( sen[-1], country.get(sen[-1], 'unknown') ) )
In your code, you need to have the length of elements in conditions
to be same as in choices
(and by extension original and country)
Upvotes: 1