Reputation: 1
I have a very large csv file that contains project descriptions; let's call it CSV A. The project description texts are under column headers named 'L0200_0', 'L0240_0', 'L0242_0', etc. I have a list of keywords stored in another csv file, let's call it CSV B. CSV B looks something like this
artificial intelligence, natural language processing, research & development, machine learning, ...
I'd like to search through the columns in question in CSV A and get a count for every string I have in CSV B.
I know I can get a count for strings by doing something similar to this.
import csv
search_for = ['artificial intelligence', 'natural language processing', 'research & development', 'machine learning']
with open('in.csv') as inf, open('out.csv','w') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in reader:
if row[0] in search_for:
print('Found: {}'.format(row))
writer.writerow(row)
However, I have a lot of keywords, and instead of listing them individually in my code, I'd rather store them in a csv file (B) and just search directly from that csv file (B) in my large csv file (A).
Upvotes: 0
Views: 79
Reputation: 517
This certainly sounds like a job for a pandas dataframe. But first, it sounds as though CSV_A may be set up the following way
'L02_A', 'L02_B', 'L02_C'
description for L02_A artificial intelligence, description for L02_B natural language processing, description for L02_C research & development machine learning research & development
If so, you will want to turn it the other way around (transpose it) so that the descriptions are in a column, and then give that column a name. If this is not the case, skip the transpose and rename steps.
import pandas as pd
import re
df = pd.read_csv("path/to/my.csv")
df = df.transpose()
df = df.rename({0:"description"}, axis=1)
output:
description
'L02_A' description for L02_A artificial intelligence
'L02_B' description for L02_B natural language processing
'L02_C' description for L02_C research & development machine learning research & development
You can certainly read your search terms in from a one line csv, but I prefer to store search terms on separate lines so that I can load them with the following code.
search_terms = [term.strip() for term in open("path/to/search_terms.txt", 'r')]
The simplest way to get the counts is to first find all the keywords, and then find the length of that list.
re_pattern = "|".join([re.escape(term) for term in search_terms])
df["search_terms_found"] = df["description"].str.findall(re_pattern)
df["num_terms_found"] = df["search_terms_found"].str.len() # in pandas str.len works on lists as well as strings
df
output:
description search_terms_found num_terms_found
'L02_A' description for L02_A artificial intelligence [artificial intelligence] 1
'L02_B' description for L02_B natural language processing [natural language processing] 1
'L02_C' description for L02_C research & development r... [research & development, research & developmen... 3
One note: if you have a very long list of search terms an Aho-Corasick trie will be faster that regex.
I use the noaho
package (pip install noaho) which makes it easy to find all non-overlapping keywords.
from noaho import NoAho
trie = NoAho()
for term in search_terms:
trie.add(term, term)
def noaho_find(text):
return [xx for xx in trie.findall_long(text)]
df["search_terms_found"] = df.apply(lambda xx: noaho_find(xx["description"]), axis=1)
Upvotes: 0