Reputation: 21
I'm attempting to write a program in python that searches ~27,000 rows for each string in a list. Each string I am searching for is in one column, and has an 'id' value in another column that I would like printed if found. The code I currently have counts the number of times that string appears in the document, but I am still unable to find a way to return specific values for each unique row in which the strings are found.
import csv
fin = open('data.csv')
words = ["happy","sad","good","bad","sunny","rainy"]
found = {}
count = 0
for line in fin:
for word in words:
if word in line:
count = count + 1
found[word] = count
print(found)
The main semantic problem with the code above is that printing the 'found' dictionary only yields one of the results and its count from the 'words' list.
Upvotes: 2
Views: 17722
Reputation: 14179
For something like this, I think using the pandas
library will keep your sanity in check. Assuming a 15,000-row CSV file with two columns, String
and ID
.
In [1]: import pandas as pd
In [2]: words = ['happy','sad','good','bad','sunny','rainy']
In [3]: df = pd.read_csv('data.csv')
In [4]: df.head(5)
Out[4]:
Strings ID
0 happy 1
1 sad 2
2 happy 3
3 sad 4
4 good 5
In [5]: for word in words:
...: print '{} : {}'.format(word, df['Strings'].str.lower().str.contains(word).sum())
...:
happy : 2501
sad : 2500
good : 2500
bad : 2500
sunny : 2499
rainy : 2500
Alternatively, you can just create a pivot table and it will have similar results.
In [30]: df_pt = df.pivot_table(index='Strings',values='ID',aggfunc=len)
In [31]: df_pt
Out[31]:
Strings
bad 2500
good 2500
happy 2501
rainy 2500
sad 2500
sunny 2499
Name: ID, dtype: int64
If you need to get the IDs for each word, you can just simply select/index/filter the data:
In [6]: df_happy = df[df['Strings'] == 'happy']
In [7]: df_happy.head(5)
Out[7]:
Strings ID
0 happy 1
2 happy 3
12 happy 13
14 happy 15
18 happy 19
If you need it as a list, then:
In [8]: list_happy = df_happy['ID'].tolist()
In [9]: list_happy[:5]
Out[9]: [1, 3, 13, 15, 19]
I've truncated some parts, obviously, but the idea remains the same.
Upvotes: 2
Reputation: 148
You said that you would like to print the id of the row when you found a word. Assuming that you have a comma separated csv file with only two colums, this is how I would do it:
fin = open('data.csv')
words = ["happy","sad","good","bad","sunny","rainy"]
found = {}
for line in fin:
str1,id=line.split(',')
for w in words:
if w in str1:
print id
found[w]=found.get(w,0)+1
break
print(found)
If you have multiple columns in the file, you could instead do:
split_line=line.split(',')
str1=split_line[0] # Whatever columns
id=split_line[1] # Whatever columns
Upvotes: 1
Reputation: 12178
for line in fin: # loops over the lines of the file
for word in words: # loops over your word list
if word in line: # checks if current word is in line
count = count + 1 # increments global variable "count". Everytime a word in the list is found in the line. With no reset or save-away inbetween whatsoever. This is the number of times any word appears in any line at the end.
found[word] = count # assigns - after all words are looped over - current "count" value to found[current_word]
so what you are doing is assigning an arbitrary value to a dictionary key which happens to be the last word you check for in every iteration.
Does not seem very useful to me. I guess you intended to do something like:
from collections import defaultdict
found = defaultdict(int)
for line in fin:
for word in words:
if word in line:
found[word] += 1
Upvotes: 0