Malte Susen
Malte Susen

Reputation: 845

Compare words and return Pandas DataFrame entry

I am planning to set up a simple function to see if words from a wordlist can be found in a Pandas DataFrame common_words. In case of a match, I would like to return the corresponding DataFrame entry, while the DF has the format life balance 14, long term 9, upper management 9, highlighting the word token and its occurrence number.

The code below is however currently only printing the search term from the wordlist (i.e. life balance), not the DataFrame entry that includes the occurrence count. I would hence need to find a way to return word instead of the wordlist element. Where is my error in reasoning?

The relevant code section is:

    # Check for matches between wordlist and Pandas dataframe
    def wordcheck():
        wordlist = ["work balance", "good management", "work life"]
        for x in wordlist:
            if df[i].str.contains(x).any():
                print('Group 1:', x)
    wordcheck()

The full code segment looks as follows:

# Loading and normalising the input file
file = open("glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)


# Datetime conversion
df['Date'] = pd.to_datetime(df['Date'])
# Adding of 'Quarter' column
df['Quarter'] = df['Date'].dt.to_period('Q')


# Word frequency analysis
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]


# Analysis loops through different qualitative sections
for i in ['Text_Pro','Text_Con','Text_Main']:
    common_words = get_top_n_bigram(df[i], 500)
    for word, freq in common_words:
        print(word, freq)


    # Check for matches between wordlist and Pandas dataframe
    def wordcheck():
        wordlist = ["work balance", "good management", "work life"]
        for x in wordlist:
            if df[i].str.contains(x).any():
                print('Group 1:', x)
    wordcheck()

Upvotes: 0

Views: 115

Answers (1)

tom davison
tom davison

Reputation: 110

I may be misunderstanding, but is this because you are only printing the searched term? So would something similar to the below work better?

# Check for matches between wordlist and Pandas dataframe
def wordcheck():
    wordlist = ["work balance", "good management", "work life"]
    for x in wordlist:
        print('Group 1:', df[i][df[i].str.contains(x).any()])
wordcheck()

Upvotes: 1

Related Questions