Reputation: 681
i have a json file filled with user comments (from web scraping) which I've pulled into python with pandas
import pandas as pd
data = pd.DataFrame(pd.read_json(filename, orient=columnName,encoding="utf-8"),columns=columnName)
data['full_text'] = data['full_text'].replace('^@ABC(\\u2019s)*[ ,\n]*', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\u2019)', "'", regex=True)
data.to_json('new_abc_short.json',orient='records')
The messages don't completely match the respective messages online. (emojis shown as \u0234 or something, apostrophes as \u2019, forward slash in links, and quote marks have back slash.
i want to clean them up so i learnt some regex, so i can pull into python, clean them up and then resave them back to json in a different name (for now) (https://docs.python.org/3/howto/regex.html)
second line helps to remove the twitter handle (if it exists in only in the beginning), then removes 's if it was used (e.g. @ABC's ). If there was no twitter handle at the beginning (maybe used in the middle of the message) then that is kept. then it removes any spaces and commas that were left behind (again only at the beginning of the string) e.g. "@ABC, hi there" becomes "hi there". "hi there @ABC" stays the same. "@ABC's twitter is big" would become "twitter is big"
third line helps replace every apostrophe that could not be shown (e.g. don\u2019t changes back to don't)
i have thousands of records (not all of them have issues with apostrophes, quotes, links etc), and based on the very small examples i've looked at, they seem to work
but my third one doesn't work:
data['full_text'] = data['full_text'].replace('\\"', '"', regex=True)
Example message in the json: "full_text":"@ABC How can you \"accidentally close\" my account"
i want to remove the \ next to the double quotes so it looks like the real message (i assume it is a escape character which the user obviously didn't type)
but no matter what i do, i can't remove it
from my regex learning, " is't a metacharacter. so backslash shouldn't even be there. But anyway, I've tried:
\\" (which i think should be the obvious one, i have \", no special quirk in " but there is in \ so i need another back slash to escape that)
\\\\" (some forums posts online mention needing 4 slashes
\\\" ( i think someone mention in the forum posts that they got it workin with 3)
\\\(\") (i know that brackets provide groupings so i tried different combinations)
(\\\\")
all of the above expression i encased in single quotes, and they didn't work. I thought maybe the double quote was the problem since i only had one, so i replaced the single quotes with single quotes x3
'''\\"'''
but none of the above worked for triple single quotes either
I keep rechecking the newly saved json and i keep seeing:
"full_text":"How can you \"accidentally close\" my account" (i.e. removing @ABC with space worked, but not the back slash bit)
originally, i tried looking into converting these unicode issues i.e. using encoding="utf-8") although my experience in this is limited and it kept failing, so regex is my best option
Upvotes: 0
Views: 186
Reputation: 2892
Ow, I missed the pandas hint, so pandas replace does use regexes. But, to be clear, str.replace doesn't work with regexes. re.sub does.
Now
"\\"
"\\\\"
r'\\'
If your string really contains a \
preceding a "
, a regex that would do is:
\\(?=\")
which does a lookahead for your "
(Look at regex101).
You would have to use something like:
re.sub(r'\\(?=\")',"",s,0)
or a pandas equivalent using that regex.
Upvotes: 1