Reputation: 329
I have a text file that contains tweets per line, that need to be altered for a machine learning format. Im using python and basic unix text manipulation (regex) to achieve a lot of my string manipulation, and im gettin the hang of sed, grep and pythons .re function....this next problem however is mindblower for me, and wondering if anyone could help me with this. I have tried a few google searches, but tbh no luck :(
I always start with pseudocode to make it easier on me, and this is what i want... "Replace -token1- OR -token2- OR -token3- OR -token4- with integer '1', replace all other words/tokens with integer '0' "
Lets say my list of words/tokens for which need to become '1' is the following:
and my tweets look like this:
The output of the new program/function would be:
NOTE1: Notice how 'cool' has a '!' behind it, it should be included as well, although i can always remove all punctuation in the file first, to make it easier
NOTE2: All tweets will be lowercase, I already have a function that changes all the lines into lowercase
Does anyone know how to do this using unix regex (such as sed, grep, awk) or even how to do it in python? BTW this is NOT homework, im working on a sentiment analysis program and am experimenting a bit.
thanx! :)
Upvotes: 2
Views: 327
Reputation: 15010
If you needed this as an all regex, then have a look at my solution here Changing lines of text into binary type pattern
Upvotes: 0
Reputation: 77145
In awk
:
awk '
NR==FNR {
a[$1];
next
}
{
gsub(/!/, "", $0) # This will ignore `!`. Other rules can be added.
for (i=1;i<=NF;i++) {
if ($i in a) {
printf "1 "
}
else {
printf "0 "
}
}
print ""
}' lookup tweets
gsub
line to handle special cases.)[jaypal:~/Temp] cat lookup
:)
cool
happy
fun
[jaypal:~/Temp] cat tweets
this has been a fun day :)
i find python cool! it makes me happy
[jaypal:~/Temp] awk '
NR==FNR {
a[$1];
next
}
{
gsub(/!/, "", $0)
for (i=1;i<=NF;i++) {
if ($i in a) {
printf "1 "
}
else {
printf "0 "
}
}
print ""
}' lookup tweets
0 0 0 0 1 0 1
0 0 0 1 0 0 0 1
Upvotes: 1
Reputation: 21635
from string import punctuation as pnc
tokens = {':)', 'cool', 'happy', 'fun'}
tweets = ['this has been a fun day :)', 'i find python cool! it makes me happy']
for tweet in tweets:
s = [(word in tokens or word.strip(pnc) in tokens) for word in tweet.split()]
print(' '.join('1' if t else '0' for t in s))
Output:
0 0 0 0 1 0 1
0 0 0 1 0 0 0 1
The or
in the 4th line is there to handle :)
, as suggested by @EOL.
There are still cases that will not be handled correctly, such as with cool :), I like it
. The problem is inherent to the requirements.
Upvotes: 8