Reputation: 833
I am following this post to create a number of columns that are true/false based on if a substring is present in another column.
Prior to using the code in the above post, I look at a field called LANGUAGES
which has values such as "ENG, SPA, CZE"
or "ENG, SPA"
. Unfortunately, the data is a comma-delimited string instead of a list but no problem, in one line, I can get the list of 25 unique values.
Once I get the list of unique values, I want to make a new column for each value so a df[ENG]
, df[SPA]
, etc. columns. I want these columns to be true/false based on if the header is a substring of the original LANGUAGES column.
Following the post, I use df.apply(lambda x: language in df.LANGUAGES, axis = 1)
. However, when I check the values of the columns (the value counts in the last for loop), all the values come up false.
How can I create a true/false column based on the column's header being a substring of another column?
My Code:
import json
import pandas as pd
import requests
url = r"https://data.hud.gov/Housing_Counselor/search?AgencyName=&City=&State=&RowLimit=&Services=&Languages="
response = requests.get(url)
if response.status_code == 200:
res = response.json()
df = pd.DataFrame(res)
df.columns = [str(h).upper() for h in list(df)]
#
# the below line is confusing but it creates a sorted list of all unique languages
#
languages = [str(s) for s in sorted(list(set((",".join(list(df["LANGUAGES"].unique()))).split(","))))]
for language in languages:
print(language)
df[language] = df.apply(lambda x: language in df.LANGUAGES, axis = 1)
for language in languages:
print(df[language].value_counts())
print("\n")
else:
print("\nConnection was unsuccesful: {0}".format(response.status_code))
Edit: There was a request for a raw data input and expected output. Here is what a column looks like:
+-------+-----------------+
| Index | LANGUAGES |
+-------+-----------------+
| 0 | 'ENG, OTH, RUS' |
| 1 | 'ENG' |
| 2 | 'ENG, CZE, SPA' |
+-------+-----------------+
This is the expected output:
+-------+-----------------+------+-------+-------+-------+-------+
| Index | LANGUAGES | ENG | CZE | OTH | RUS | SPA |
+-------+-----------------+------+-------+-------+-------+-------+
| 0 | 'ENG, OTH, RUS' | TRUE | FALSE | TRUE | TRUE | FALSE |
| 1 | 'ENG' | TRUE | FALSE | FALSE | FALSE | FALSE |
| 2 | 'ENG, CZE, SPA' | TRUE | TRUE | FALSE | TRUE | FALSE |
+-------+-----------------+------+-------+-------+-------+-------+
Upvotes: 1
Views: 988
Reputation: 833
Found in this post, I swapped out the following line of code:
df[language] = df.apply(lambda x: language in df.LANGUAGES, axis = 1)
for the following two lines:
criteria = lambda row : language in row["LANGUAGES"]
df[language] = df.apply(criteria, axis =1)
And it works.
import json
import pandas as pd
import requests
url = r"https://data.hud.gov/Housing_Counselor/search?AgencyName=&City=&State=&RowLimit=&Services=&Languages="
response = requests.get(url)
if response.status_code == 200:
res = response.json()
df = pd.DataFrame(res)
df.columns = [str(h).upper() for h in list(df)]
#
# the below line is confusing but it creates a sorted list of all unique languages
#
languages = [str(s) for s in sorted(list(set((",".join(list(df["LANGUAGES"].unique()))).split(","))))]
for language in languages:
criteria = lambda row : language in row["LANGUAGES"]
df[language] = df.apply(criteria, axis =1)
for language in languages:
print(df[language].value_counts())
print("\n")
else:
print("\nConnection was unsuccesful: {0}".format(response.status_code))
This line swap could also work:
for language in languages:
df[language] = df.LANGUAGES.apply(lambda x: 'True' if language in x else 'False')
print("{}:{}".format(language, df[df[language] == 'True'].shape[0]))
Upvotes: 0
Reputation: 23099
Two steps,
first, we explode your list and create a pivot table to re-concat to your original df based on the index.
s = df['LANGUAGES'].str.replace("'",'').str.split(',').explode().to_frame()
cols = s['LANGUAGES'].drop_duplicates(keep='first').tolist()
df2 = pd.concat([df, pd.crosstab(s.index, s["LANGUAGES"])[cols]], axis=1).replace(
{1: True, 0: False}
)
print(df2)
LANGUAGES ENG OTH RUS CZE SPA
0 'ENG, OTH, RUS' True True True False False
1 'ENG' True False False False False
2 'ENG, CZE, SPA' True False False True True
Upvotes: 2