Reputation: 227
I'm pulling data from a website through an API and sorting it into pandas dataframes. One of my dataframes has a column containing a value that has non-standard characters, which won't pass through to MySQL using .to_sql. I'd like to drop any rows that contain non-standard characters or find a way to eliminate the troublesome characters while keeping the rest of the value.
A sample of my dataframe looks like this....
╔══════╦══════╦══════════════════════════════════════════════════════════════╦═══════╦══════╗
║ ColA ║ ColB ║ ColC ║ ColD ║ ColE ║
╠══════╬══════╬══════════════════════════════════════════════════════════════╬═══════╬══════╣
║ 1111 ║ MHA ║ / ║ 43920 ║ 589 ║
╠══════╬══════╬══════════════════════════════════════════════════════════════╬═══════╬══════╣
║ 1111 ║ MHA ║ /nw-i/vhcl/2020-Ho-Ar-Sdn-Sprt-1-1GFKI0943/ ║ 42 ║ 3 ║
╠══════╬══════╬══════════════════════════════════════════════════════════════╬═══════╬══════╣
║ 1111 ║ MHA ║ /ud-i/vhcl/2016-Ho-Ci-Sdn-Reg-I-1F559KOZ1¬Ø/_(„ÉÑ)_/¬Ø/ ║ 8 ║ 1 ║
╚══════╩══════╩══════════════════════════════════════════════════════════════╩═══════╩══════╝
You can see the line with /ud-i/ has odd characters at the end of it (¯/(ツ)/¯), which I can't seem to eliminate. I've tried this, which has worked for me in the past...
df = df[df.ColC !='/ud-i/vhcl/2016-Ho-Ci-Sdn-Reg-I-1F559KOZ1¯/_(ツ)_/¯/']
But I can't seem to get it to work now.
As a note /ud-i/ and /nw-i/ type rows come up a lot in the dataframe 99.9% of which are fine, so I'm really just looking to exclude anything that contains none standard characters vs anything that starts with /ud-i/ or something along those lines. I also don't want to just filter for these specific characters (if I don't have to) since other odd ones may populate during different pulls.
Any help on this would be much appreciated! Thank you!
EDIT: Additional Code
gaData = []
gaDataOne = []
def getGAD(view_ID, startDate, endDate, nextPageToken):
l = view_id_start + str(int(view_ID))
response = analytics.reports().batchGet(
body={
'reportRequests': [
{
'viewId': l,
'dateRanges': [{'startDate': startDate, 'endDate': endDate}],
'metrics': [
{'expression': 'ga:ColD'}
],
'dimensions': [
{'name': 'ga:ColC'}
],
'pageToken': nextPageToken,
'pageSize': '10000'
}]}).execute()
report_list = response.get('reports')
for report in report_list:
data_rows = report.get('data', {}).get('rows', [])
for row in data_rows:
dimensions_in_row = row.get('dimensions')
metrics_rows = row.get('metrics')
for metrics in metrics_rows:
metrics_values = metrics.get('values')
full_row_data = dimensions_in_row + metrics_values
gaData.append(full_row_data)
gaData = []
for i in gaData:
new_tuple = []
new_tuple.append(i[0])
new_tuple.append(int(i[1]))
new_tuple.append(float(i[2]))
gaDataNew.append(tuple(new_tuple))
col = [
'ColC', 'ColD']
dfX = pd.DataFrame(gaDataNew, columns = col)
print(dfX)
return dfX
def getGADOne(view_ID, startDate, endDate):
l = view_id_start + str(int(view_ID))
response = analytics.reports().batchGet(
body={
'reportRequests': [
{
'viewId': l,
'dateRanges': [{'startDate': startDate, 'endDate': endDate}],
'metrics': [
{'expression': 'ga:ColD'}
],
'dimensions': [
{'name': 'ga:ColC'},
{'name': 'ga:segment'}
],
'segments': [
{
'segmentId': 'gaid::Xjfjfn30323j4'
}]
}]}).execute()
report_list = response.get('reports')
for report in report_list:
data_rows = report.get('data', {}).get('rows', [])
for row in data_rows:
dimensions_in_row = row.get('dimensions')
metrics_rows = row.get('metrics')
for metrics in metrics_rows:
metrics_values = metrics.get('values')
full_row_data = dimensions_in_row + metrics_values
gaDataOne.append(full_row_data)
gaDataOneNew = []
for i in gaDataOne:
new_tuple = []
new_tuple.append(i[0])
new_tuple.append(int(i[2]))
gaDataOneNew.append(tuple(new_tuple))
colOne = [
'ColC', 'ColE']
dfY = pd.DataFrame(gaDataOneNew, columns = colOne)
print(dfY)
return dfY
for view_ID in lZ:
id_str = str(int(lid))
dfXX = getGAData(view_ID, start_date, end_date, None)
dfYY = getGADataOne(view_ID, start_date, end_date)
df = pd.merge(dfXX, dfYY, how = 'outer', on = ['ColC']).fillna(0)
df = df[['ColC', 'ColD', 'ColE']]
df['ColC'].tolist()
df[df['ColC'].apply(lambda x:x.isascii())]
ColA = 1111
ColB = 'MHA'
df.insert(loc=0, column='ColA', value=ColA)
df.insert(loc=1, column='ColB', value=ColB)
Then to look at the final results before sending to MySQL I've been pushing it to a csv...
df.to_csv('df.csv', index=False)
There's some more code to this like the package imports and API connection info that I didn't add just in case you see something that doesn't look like it should be there.
Upvotes: 0
Views: 203
Reputation: 17368
If you are using 3.7+, there is a function called isascii()
- https://docs.python.org/3/library/stdtypes.html#str.isascii
n [26]: import pandas as pd
In [27]: df['Colc'].tolist()
Out[27]:
['/',
'/nw-i/vhcl/2020-Ho-Ar-Sdn-Sprt-1-1GFKI0943/',
'/ud-i/vhcl/2016-Ho-Ci-Sdn-Reg-I-1F559KOZ1¯/_(ツ)_/¯/']
In [28]: df[df["Colc"].apply(lambda x:x.isascii())]
Out[28]:
Colc
0 /
1 /nw-i/vhcl/2020-Ho-Ar-Sdn-Sprt-1-1GFKI0943/
Upvotes: 1