adin
adin

Reputation: 833

pandas - create true/false column if column header is substring of another column

I am following this post to create a number of columns that are true/false based on if a substring is present in another column.

Prior to using the code in the above post, I look at a field called LANGUAGES which has values such as "ENG, SPA, CZE" or "ENG, SPA". Unfortunately, the data is a comma-delimited string instead of a list but no problem, in one line, I can get the list of 25 unique values.

Once I get the list of unique values, I want to make a new column for each value so a df[ENG], df[SPA], etc. columns. I want these columns to be true/false based on if the header is a substring of the original LANGUAGES column.

Following the post, I use df.apply(lambda x: language in df.LANGUAGES, axis = 1). However, when I check the values of the columns (the value counts in the last for loop), all the values come up false.

How can I create a true/false column based on the column's header being a substring of another column?

My Code:

import json
import pandas as pd
import requests

url  = r"https://data.hud.gov/Housing_Counselor/search?AgencyName=&City=&State=&RowLimit=&Services=&Languages="

response = requests.get(url)

if response.status_code == 200:
    res = response.json()
    df = pd.DataFrame(res)
    df.columns = [str(h).upper() for h in list(df)]
    #
    # the below line is confusing but it creates a sorted list of all unique languages
    #
    languages = [str(s) for s in sorted(list(set((",".join(list(df["LANGUAGES"].unique()))).split(","))))]
    for language in languages:
        print(language)
        df[language] = df.apply(lambda x: language in df.LANGUAGES, axis = 1)
    for language in languages:
        print(df[language].value_counts())
        print("\n")
else:
    print("\nConnection was unsuccesful: {0}".format(response.status_code))

Edit: There was a request for a raw data input and expected output. Here is what a column looks like:

+-------+-----------------+
| Index |    LANGUAGES    |
+-------+-----------------+
|     0 | 'ENG, OTH, RUS' |
|     1 | 'ENG'           |
|     2 | 'ENG, CZE, SPA' |
+-------+-----------------+

This is the expected output:

+-------+-----------------+------+-------+-------+-------+-------+
| Index |    LANGUAGES    | ENG  |  CZE  |  OTH  |  RUS  |  SPA  |
+-------+-----------------+------+-------+-------+-------+-------+
|     0 | 'ENG, OTH, RUS' | TRUE | FALSE | TRUE  | TRUE  | FALSE |
|     1 | 'ENG'           | TRUE | FALSE | FALSE | FALSE | FALSE |
|     2 | 'ENG, CZE, SPA' | TRUE | TRUE  | FALSE | TRUE  | FALSE |
+-------+-----------------+------+-------+-------+-------+-------+

Upvotes: 1

Views: 988

Answers (2)

adin
adin

Reputation: 833

Found in this post, I swapped out the following line of code:

df[language] = df.apply(lambda x: language in df.LANGUAGES, axis = 1)

for the following two lines:

    criteria = lambda row : language in row["LANGUAGES"]
    df[language] = df.apply(criteria, axis =1)

And it works.

import json
import pandas as pd
import requests

url  = r"https://data.hud.gov/Housing_Counselor/search?AgencyName=&City=&State=&RowLimit=&Services=&Languages="

response = requests.get(url)

if response.status_code == 200:
    res = response.json()
    df = pd.DataFrame(res)
    df.columns = [str(h).upper() for h in list(df)]
    #
    # the below line is confusing but it creates a sorted list of all unique languages
    #
    languages = [str(s) for s in sorted(list(set((",".join(list(df["LANGUAGES"].unique()))).split(","))))]
    for language in languages:
        criteria = lambda row : language in row["LANGUAGES"]
        df[language] = df.apply(criteria, axis =1)
    for language in languages:
        print(df[language].value_counts())
        print("\n")
else:
    print("\nConnection was unsuccesful: {0}".format(response.status_code))

This line swap could also work:

for language in languages:
    df[language] = df.LANGUAGES.apply(lambda x: 'True' if language in x else 'False')
    print("{}:{}".format(language, df[df[language] == 'True'].shape[0]))

Upvotes: 0

Umar.H
Umar.H

Reputation: 23099

Two steps,

first, we explode your list and create a pivot table to re-concat to your original df based on the index.

s  = df['LANGUAGES'].str.replace("'",'').str.split(',').explode().to_frame()

cols = s['LANGUAGES'].drop_duplicates(keep='first').tolist()

df2 = pd.concat([df, pd.crosstab(s.index, s["LANGUAGES"])[cols]], axis=1).replace(
    {1: True, 0: False}
)
print(df2)
         LANGUAGES   ENG    OTH    RUS    CZE    SPA
0  'ENG, OTH, RUS'  True   True   True  False  False
1            'ENG'  True  False  False  False  False
2  'ENG, CZE, SPA'  True  False  False   True   True

Upvotes: 2

Related Questions