I have a text file that contains tweets per line, that need to be altered for a machine learning format. Im using python and basic unix text manipulation (regex) to achieve a lot of my string manipulation, and im gettin the hang of sed, grep and pythons .re function....this next problem however is mindblower for me, and wondering if anyone could help me with this. I have tried a few google searches, but tbh no luck :( I always start with pseudocode to make it easier on me, and this is what i want... " Replace -token1- OR -token2- OR -token3- OR -token4- with integer '1', replace all other words/tokens with integer '0' " Lets say my list of words/tokens for which need to become '1' is the following: :) cool happy fun and my tweets look like this: this has been a fun day :) i find python cool! it makes me happy The output of the new program/function would be: 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 NOTE1: Notice how 'cool' has a '!' behind it, it should be included as well, although i can always remove all punctuation in the file first, to make it easier NOTE2: All tweets will be lowercase, I already have a function that changes all the lines into lowercase Does anyone know how to do this using unix regex (such as sed, grep, awk) or even how to do it in python? BTW this is NOT homework, im working on a sentiment analysis program and am experimenting a bit. thanx! :)

Reputation: 329

Best way to change words into numbers using specific word list

I have a text file that contains tweets per line, that need to be altered for a machine learning format. Im using python and basic unix text manipulation (regex) to achieve a lot of my string manipulation, and im gettin the hang of sed, grep and pythons .re function....this next problem however is mindblower for me, and wondering if anyone could help me with this. I have tried a few google searches, but tbh no luck :(

I always start with pseudocode to make it easier on me, and this is what i want... "Replace -token1- OR -token2- OR -token3- OR -token4- with integer '1', replace all other words/tokens with integer '0' "

Lets say my list of words/tokens for which need to become '1' is the following:

:)
cool
happy
fun

and my tweets look like this:

this has been a fun day :)
i find python cool! it makes me happy

The output of the new program/function would be:

0 0 0 0 1 0 1
0 0 0 1 0 0 0 1

NOTE1: Notice how 'cool' has a '!' behind it, it should be included as well, although i can always remove all punctuation in the file first, to make it easier

NOTE2: All tweets will be lowercase, I already have a function that changes all the lines into lowercase

Does anyone know how to do this using unix regex (such as sed, grep, awk) or even how to do it in python? BTW this is NOT homework, im working on a sentiment analysis program and am experimenting a bit.

thanx! :)

Upvotes: 2

Answers (3)

Ro Yo Mi

Reputation: 15010

If you needed this as an all regex, then have a look at my solution here Changing lines of text into binary type pattern

Upvotes: 0

jaypal singh

Reputation: 77145

In awk:

awk '
NR==FNR {
    a[$1];
    next
    }

{ 
    gsub(/!/, "", $0)  # This will ignore `!`. Other rules can be added.
    for (i=1;i<=NF;i++) {
        if ($i in a) {
        printf "1 "
        }
    else {
        printf "0 "
        }
    }
    print ""
}' lookup tweets

Test: (You'll probably need to alter `gsub` line to handle special cases.)

[jaypal:~/Temp] cat lookup
:)
cool
happy
fun

[jaypal:~/Temp] cat tweets
this has been a fun day :)
i find python cool! it makes me happy

[jaypal:~/Temp] awk '
NR==FNR {
    a[$1];
    next
    }

{ 
    gsub(/!/, "", $0)
    for (i=1;i<=NF;i++) {
        if ($i in a) {
        printf "1 "
        }
    else {
        printf "0 "
        }
    }
    print ""
}' lookup tweets
0 0 0 0 1 0 1
0 0 0 1 0 0 0 1

Upvotes: 1

Elazar

Reputation: 21635

from string import punctuation as pnc
tokens = {':)', 'cool', 'happy', 'fun'}
tweets = ['this has been a fun day :)', 'i find python cool! it makes me happy']
for tweet in tweets:
    s = [(word in tokens or word.strip(pnc) in tokens) for word in tweet.split()]
    print(' '.join('1' if t else '0' for t in s))

Output:

0 0 0 0 1 0 1
0 0 0 1 0 0 0 1

The or in the 4th line is there to handle :), as suggested by @EOL.

There are still cases that will not be handled correctly, such as with cool :), I like it. The problem is inherent to the requirements.

Upvotes: 8

Best way to change words into numbers using specific word list

Answers (3)

Test: (You'll probably need to alter gsub line to handle special cases.)

Related Questions

Test: (You'll probably need to alter `gsub` line to handle special cases.)