Reputation: 141
Python beginner here. I'm stumped on part of this code for a bot I'm writing.
I am making a reddit bot using Praw to comb through posts and removed a specific set of characters (steam CD keys).
I made a test post here: https://www.reddit.com/r/pythonforengineers/comments/91m4l0/testing_my_reddit_scraping_bot/
This should have all the formats of keys.
Currently, my bot is able to find the post using a regex expression. I have these variables:
steamKey15 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w')
steamKey25 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.')
steamKey17 = (r'\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\s\w\w')
I am finding the text using this:
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
if re.search(steamKey15, submission.selftext, re.IGNORECASE):
searchLogic()
saveSteamKey()
So this is just to show that the things I should be using in a filter function is a combination of steamKey15/25/17, and submission.selftext.
So here is the part where I am confused. I cant find a function that works, or is doing what I want. My goal is to remove all the text from submission.selftext(the body of the post) BUT the keys, which will eventually be saved in a .txt file.
Any advice on a good way to go around this? I've looked into re.sub and .translate but I don't understand how the parts fit together.
I am using Python 3.7 if it helps.
Upvotes: 0
Views: 308
Reputation: 223052
can't you just get the regexp results?
m = re.search(steamKey15, submission.selftext, re.IGNORECASE)
if m:
print(m.group(0))
Also note that a dot .
means any char in a regexp. If you want to match only dots, you should use \.
. You can probably write your regexp like this instead:
r'\w{5}[-.]\w{5}[-.]\w{5}'
This will match the key when separated by .
or by -
.
Note that this will also match anything that begin or end with a key, or has a key in the middle - that can cause you problems as your 15-char key regexp is contained in the 25-key one! To fix that use negative lookahead/negative lookbehind:
r'(?<![\w.-])\w{5}[-.]\w{5}[-.]\w{5}(?![\w.-])'
that will only find the keys if there are no extraneous characters before and after them
Another hint is to use re.findall
instead of re.search
- some posts contain more than one steam key in the same post! findall
will return all matches while search
only returns the first one.
Upvotes: 2
Reputation: 99
So a couple things first .
means any character in regex. I think you know that, but just to be sure. Also \w\w\w\w\w
can be replaced with \w{5}
where this specifies 5 alphanumerics. I would use re.findall
.
import re
steamKey15 = (r'(?:\w{5}.){2}\w{5}')
steamKey25 = (r'(?:\w{5}.){5}')
steamKey17 = (r'\w{15}\s\w\w')
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
if submission.id not in steamKeyPostID:
finds_15 = re.findall(steamKey15, submission.selftext)
finds_25 = re.findall(steamKey25, submission.selftext)
finds_17 = re.findall(steamKey17, submission.selftext)
Upvotes: 1