How do I exclude all rows with certain characters from my dataframe?

Question

I'm pulling data from a website through an API and sorting it into pandas dataframes. One of my dataframes has a column containing a value that has non-standard characters, which won't pass through to MySQL using .to_sql. I'd like to drop any rows that contain non-standard characters or find a way to eliminate the troublesome characters while keeping the rest of the value.

A sample of my dataframe looks like this....

╔══════╦══════╦══════════════════════════════════════════════════════════════╦═══════╦══════╗
║ ColA ║ ColB ║ ColC                                                         ║ ColD  ║ ColE ║
╠══════╬══════╬══════════════════════════════════════════════════════════════╬═══════╬══════╣
║ 1111 ║ MHA  ║ /                                                            ║ 43920 ║ 589  ║
╠══════╬══════╬══════════════════════════════════════════════════════════════╬═══════╬══════╣
║ 1111 ║ MHA  ║ /nw-i/vhcl/2020-Ho-Ar-Sdn-Sprt-1-1GFKI0943/                  ║ 42    ║ 3    ║
╠══════╬══════╬══════════════════════════════════════════════════════════════╬═══════╬══════╣
║ 1111 ║ MHA  ║ /ud-i/vhcl/2016-Ho-Ci-Sdn-Reg-I-1F559KOZ1¬Ø/_(„ÉÑ)_/¬Ø/      ║ 8     ║ 1    ║
╚══════╩══════╩══════════════════════════════════════════════════════════════╩═══════╩══════╝

You can see the line with /ud-i/ has odd characters at the end of it (¬Ø/(„ÉÑ)/¬Ø), which I can't seem to eliminate. I've tried this, which has worked for me in the past...

df = df[df.ColC !='/ud-i/vhcl/2016-Ho-Ci-Sdn-Reg-I-1F559KOZ1¬Ø/_(„ÉÑ)_/¬Ø/']

But I can't seem to get it to work now.

As a note /ud-i/ and /nw-i/ type rows come up a lot in the dataframe 99.9% of which are fine, so I'm really just looking to exclude anything that contains none standard characters vs anything that starts with /ud-i/ or something along those lines. I also don't want to just filter for these specific characters (if I don't have to) since other odd ones may populate during different pulls.

Any help on this would be much appreciated! Thank you!

EDIT: Additional Code

gaData = []
gaDataOne = []

def getGAD(view_ID, startDate, endDate, nextPageToken):
    l = view_id_start + str(int(view_ID))
    response = analytics.reports().batchGet(
            body={
            'reportRequests': [
            {
                'viewId': l,
                'dateRanges': [{'startDate': startDate, 'endDate': endDate}],
                'metrics': [
                    {'expression': 'ga:ColD'}
                    ],
                'dimensions': [
                    {'name': 'ga:ColC'}
                    ],
                'pageToken': nextPageToken,
                'pageSize': '10000'
            }]}).execute()
    report_list = response.get('reports')
    for report in report_list:
        data_rows = report.get('data', {}).get('rows', [])
        for row in data_rows:
            dimensions_in_row = row.get('dimensions')
            metrics_rows = row.get('metrics')
            for metrics in metrics_rows:
                metrics_values = metrics.get('values')
                full_row_data = dimensions_in_row + metrics_values
                gaData.append(full_row_data)

    gaData = []
    for i in gaData:
        new_tuple = []
        new_tuple.append(i[0])
        new_tuple.append(int(i[1]))
        new_tuple.append(float(i[2]))
        gaDataNew.append(tuple(new_tuple))

    col = [
    'ColC', 'ColD']
    dfX = pd.DataFrame(gaDataNew, columns = col)
    print(dfX)
    return dfX

def getGADOne(view_ID, startDate, endDate):
    l = view_id_start + str(int(view_ID))
    response = analytics.reports().batchGet(
            body={
            'reportRequests': [
            {
                'viewId': l,
                'dateRanges': [{'startDate': startDate, 'endDate': endDate}],
                'metrics': [
                    {'expression': 'ga:ColD'}
                    ],
                'dimensions': [
                    {'name': 'ga:ColC'},
                    {'name': 'ga:segment'}
                    ],
                'segments': [
                    {
                        'segmentId': 'gaid::Xjfjfn30323j4'
                    }]
            }]}).execute()

    report_list = response.get('reports')
    for report in report_list:
        data_rows = report.get('data', {}).get('rows', [])
        for row in data_rows:
            dimensions_in_row = row.get('dimensions')
            metrics_rows = row.get('metrics')
            for metrics in metrics_rows:
                metrics_values = metrics.get('values')
                full_row_data = dimensions_in_row + metrics_values
                gaDataOne.append(full_row_data)

    gaDataOneNew = []
    for i in gaDataOne:
        new_tuple = []
        new_tuple.append(i[0])
        new_tuple.append(int(i[2]))
        gaDataOneNew.append(tuple(new_tuple))

    colOne = [
    'ColC', 'ColE']
    dfY = pd.DataFrame(gaDataOneNew, columns = colOne)
    print(dfY)
    return dfY

for view_ID in lZ:
    id_str = str(int(lid))
    dfXX = getGAData(view_ID, start_date, end_date, None)
    dfYY = getGADataOne(view_ID, start_date, end_date)

df = pd.merge(dfXX, dfYY, how = 'outer', on = ['ColC']).fillna(0)
df = df[['ColC', 'ColD', 'ColE']]
df['ColC'].tolist()
df[df['ColC'].apply(lambda x:x.isascii())]

ColA = 1111
ColB = 'MHA'

df.insert(loc=0, column='ColA', value=ColA)
df.insert(loc=1, column='ColB', value=ColB)

Then to look at the final results before sending to MySQL I've been pushing it to a csv...

df.to_csv('df.csv', index=False)

There's some more code to this like the package imports and API connection info that I didn't add just in case you see something that doesn't look like it should be there.

bigbounty · Accepted Answer

If you are using 3.7+, there is a function called isascii() - https://docs.python.org/3/library/stdtypes.html#str.isascii

n [26]: import pandas as pd

In [27]: df['Colc'].tolist()
Out[27]:
['/',
 '/nw-i/vhcl/2020-Ho-Ar-Sdn-Sprt-1-1GFKI0943/',
 '/ud-i/vhcl/2016-Ho-Ci-Sdn-Reg-I-1F559KOZ1¬Ø/_(„ÉÑ)_/¬Ø/']

In [28]: df[df["Colc"].apply(lambda x:x.isascii())]
Out[28]:
                                          Colc
0                                            /
1  /nw-i/vhcl/2020-Ho-Ar-Sdn-Sprt-1-1GFKI0943/

How do I exclude all rows with certain characters from my dataframe?

Answers (1)

Related Questions