Nicolas Blackburn
Nicolas Blackburn

Reputation: 1

How to get the count of strings from a csv file in another bigger csv file?

I have a very large csv file that contains project descriptions; let's call it CSV A. The project description texts are under column headers named 'L0200_0', 'L0240_0', 'L0242_0', etc. I have a list of keywords stored in another csv file, let's call it CSV B. CSV B looks something like this

artificial intelligence, natural language processing, research & development, machine learning, ...

I'd like to search through the columns in question in CSV A and get a count for every string I have in CSV B.

I know I can get a count for strings by doing something similar to this.

import csv
search_for = ['artificial intelligence', 'natural language processing', 'research & development', 'machine learning']
with open('in.csv') as inf, open('out.csv','w') as outf:
    reader = csv.reader(inf)
    writer = csv.writer(outf, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        if row[0] in search_for:
            print('Found: {}'.format(row))
            writer.writerow(row)

However, I have a lot of keywords, and instead of listing them individually in my code, I'd rather store them in a csv file (B) and just search directly from that csv file (B) in my large csv file (A).

Upvotes: 0

Views: 79

Answers (1)

coradek
coradek

Reputation: 517

This certainly sounds like a job for a pandas dataframe. But first, it sounds as though CSV_A may be set up the following way

'L02_A', 'L02_B', 'L02_C'
description for L02_A artificial intelligence, description for L02_B natural language processing, description for L02_C research & development machine learning research & development

If so, you will want to turn it the other way around (transpose it) so that the descriptions are in a column, and then give that column a name. If this is not the case, skip the transpose and rename steps.

import pandas as pd
import re

df = pd.read_csv("path/to/my.csv")
df = df.transpose()
df = df.rename({0:"description"}, axis=1)
output:
        description
'L02_A' description for L02_A artificial intelligence
'L02_B' description for L02_B natural language processing
'L02_C' description for L02_C research & development machine learning research & development

You can certainly read your search terms in from a one line csv, but I prefer to store search terms on separate lines so that I can load them with the following code.

search_terms = [term.strip() for term in open("path/to/search_terms.txt", 'r')]

The simplest way to get the counts is to first find all the keywords, and then find the length of that list.


re_pattern = "|".join([re.escape(term) for term in search_terms])
df["search_terms_found"] = df["description"].str.findall(re_pattern)
df["num_terms_found"] = df["search_terms_found"].str.len() # in pandas str.len works on lists as well as strings
df
output:
        description                                         search_terms_found                    num_terms_found
'L02_A' description for L02_A artificial intelligence       [artificial intelligence]                           1
'L02_B' description for L02_B natural language processing   [natural language processing]                       1
'L02_C' description for L02_C research & development r...   [research & development, research & developmen...   3

One note: if you have a very long list of search terms an Aho-Corasick trie will be faster that regex. I use the noaho package (pip install noaho) which makes it easy to find all non-overlapping keywords.

from noaho import NoAho
trie = NoAho()
for term in search_terms:
    trie.add(term, term)

def noaho_find(text):
    return [xx for xx in trie.findall_long(text)]

df["search_terms_found"] = df.apply(lambda xx: noaho_find(xx["description"]), axis=1)

Upvotes: 0

Related Questions