Reputation: 91
I have a list like this.how can i eliminate \xe2\x80\x99,\xe2\x80\x9c etc from my list using python. is there anyway to eliminate these kind of data from my list?? common pattern is avilable?
['guest', 'demo', ':', 'eric', 'iverson', '\xe2\x80\x99s', 'itty', 'bitty', 'search', 'february', '16', 'th', ',', '2010', 'by', 'daniel', 'tunkelang', 'respond', 'i', '\xe2\x80\x99m', 'back', 'from', 'vacation', ',', 'and', 'still', 'digging', 'my', 'way', 'out', 'of', 'everything', 'that', '\xe2\x80\x99s', 'piled', 'up', 'while', 'i', '\xe2\x80\x99ve', 'been', 'offline', 'while', 'i', 'catch', 'up', ',', 'i', 'thought', 'i', '\xe2\x80\x99d', 'share', 'with', 'you', 'a', 'demo', 'that', 'eric', 'iverson', 'was', 'gracious', 'enough', 'to', 'share', 'with', 'me', 'it', 'uses', 'yahoo', '!', 'boss', 'to', 'support', 'an', 'exploratory', 'search', 'experience', 'on', 'top', 'of', 'a', 'general', 'web', 'search', 'engine', 'when', 'you', 'perform', 'a', 'query', ',', 'the', 'application', 'retrieves', 'a', 'set', 'of', 'related', 'term', 'candidates', 'using', 'yahoo', '\xe2\x80\x99s', 'key', 'terms', 'api', 'it', 'then', 'scores', 'each', 'term', 'by', 'dividing', 'it', 'is', 'occurrence', 'count', 'within', 'the', 'result', 'set', 'by', 'it', 'is', 'global', 'occurrence', 'count', '\xe2\x80\x93a', 'relevance', 'measure', 'similar', 'to', 'one', 'my', 'former', 'colleagues', 'and', 'i', 'used', 'at', 'endeca', 'in', 'enterprise', 'contexts', 'you', 'can', 'try', 'out', 'the', 'demo', 'yourself', 'at', 'http', '://www', 'ittybittysearch', 'com', '/', 'while', 'it', 'has', 'rough', 'edges', ',', 'it', 'produces', 'nice', 'results', '\xe2\x80\x93especially', 'considering', 'the', 'simplicity', 'of', 'the', 'approach', 'here', '\xe2\x80\x99s', 'an', 'example', 'of', 'how', 'i', 'used', 'the', 'application', 'to', 'explore', 'and', 'learn', 'something', 'new', 'i', 'started', 'with', '["', 'information', 'retrieval', '"]', 'i', 'noticed', '\xe2\x80\x9c', 'interactive', 'information', 'retrieval', '\xe2\x80\x9d', 'as', 'a', 'top', 'term', ',', 'so', 'i', 'used', 'it', 'to', 'refine', 'most', 'of', 'the', 'refinement', 'suggestions', 'looked', 'familiar', 'to', 'me', '\xe2\x80\x93but', 'an', 'unfamiliar', 'name', 'caught', 'my', 'attention', ':', '\xe2\x80\x9c', 'anton', 'leuski', '\xe2\x80\x9d', 'following', 'my', 'curiosity', ',', 'i', 'refined', 'again', 'looking', 'at', 'the', 'results', ',', 'i', 'immediately', 'saw', 'that', 'leuski', 'had', 'done', 'work', 'on', 'evaluating', 'document', 'clustering', 'for', 'interactive', 'information', 'retrieval', 'further', 'exploration', 'made', 'it', 'clear', 'this', 'is', 'someone', 'whose', 'work', 'i', 'should', 'get', 'to', 'know', '\xe2\x80\x93check', 'out', 'his', 'home', 'page', '!', 'i', 'can', '\xe2\x80\x99t', 'promise', 'that', 'you', '\xe2\x80\x99ll', 'have', 'as', 'productive', 'an', 'experience', 'as', 'i', 'did', ',', 'but', 'i', 'encourage', 'you', 'to', 'try', 'eric', '\xe2\x80\x99s', 'demo', 'it', '\xe2\x80\x99s', 'simple', 'examples', 'like', 'these', 'that', 'remind', 'me', 'of', 'the', 'value', 'of', 'pursuing', 'hcir', 'for', 'the', 'open', 'web', 'speaking', 'of', 'which', ',', 'hcir', '2010', 'is', 'in', 'the', 'works', 'we', '\xe2\x80\x99ll', 'flesh', 'out', 'the', 'details', 'over', 'the', 'next', 'weeks', ',', 'and', 'of', 'course', 'i', '\xe2\x80\x99ll', 'share', 'them', 'here']
Upvotes: 1
Views: 1924
Reputation: 362716
If I could hazard a guess that the input was utf8 coding, you could do something like this:
>>> from unidecode import unidecode
>>> my_list = ['guest', 'demo', ':', 'eric', 'iverson', '\xe2\x80\x99s', 'itty', 'bitty', 'search', 'february', '16', 'th', ',', '2010', 'by', 'daniel', 'tunkelang', 'respond', 'i', '\xe2\x80\x99m', 'back', 'from', 'vacation', ',', 'and', 'still', 'digging', 'my', 'way', 'out', 'of', 'everything', 'that', '\xe2\x80\x99s', 'piled', 'up', 'while', 'i', '\xe2\x80\x99ve', 'been', 'offline', 'while', 'i', 'catch', 'up', ',', 'i', 'thought', 'i', '\xe2\x80\x99d', 'share', 'with', 'you', 'a', 'demo', 'that', 'eric', 'iverson', 'was', 'gracious', 'enough', 'to', 'share', 'with', 'me', 'it', 'uses', 'yahoo', '!', 'boss', 'to', 'support', 'an', 'exploratory', 'search', 'experience', 'on', 'top', 'of', 'a', 'general', 'web', 'search', 'engine', 'when', 'you', 'perform', 'a', 'query', ',', 'the', 'application', 'retrieves', 'a', 'set', 'of', 'related', 'term', 'candidates', 'using', 'yahoo', '\xe2\x80\x99s', 'key', 'terms', 'api', 'it', 'then', 'scores', 'each', 'term', 'by', 'dividing', 'it', 'is', 'occurrence', 'count', 'within', 'the', 'result', 'set', 'by', 'it', 'is', 'global', 'occurrence', 'count', '\xe2\x80\x93a', 'relevance', 'measure', 'similar', 'to', 'one', 'my', 'former', 'colleagues', 'and', 'i', 'used', 'at', 'endeca', 'in', 'enterprise', 'contexts', 'you', 'can', 'try', 'out', 'the', 'demo', 'yourself', 'at', 'http', '://www', 'ittybittysearch', 'com', '/', 'while', 'it', 'has', 'rough', 'edges', ',', 'it', 'produces', 'nice', 'results', '\xe2\x80\x93especially', 'considering', 'the', 'simplicity', 'of', 'the', 'approach', 'here', '\xe2\x80\x99s', 'an', 'example', 'of', 'how', 'i', 'used', 'the', 'application', 'to', 'explore', 'and', 'learn', 'something', 'new', 'i', 'started', 'with', '["', 'information', 'retrieval', '"]', 'i', 'noticed', '\xe2\x80\x9c', 'interactive', 'information', 'retrieval', '\xe2\x80\x9d', 'as', 'a', 'top', 'term', ',', 'so', 'i', 'used', 'it', 'to', 'refine', 'most', 'of', 'the', 'refinement', 'suggestions', 'looked', 'familiar', 'to', 'me', '\xe2\x80\x93but', 'an', 'unfamiliar', 'name', 'caught', 'my', 'attention', ':', '\xe2\x80\x9c', 'anton', 'leuski', '\xe2\x80\x9d', 'following', 'my', 'curiosity', ',', 'i', 'refined', 'again', 'looking', 'at', 'the', 'results', ',', 'i', 'immediately', 'saw', 'that', 'leuski', 'had', 'done', 'work', 'on', 'evaluating', 'document', 'clustering', 'for', 'interactive', 'information', 'retrieval', 'further', 'exploration', 'made', 'it', 'clear', 'this', 'is', 'someone', 'whose', 'work', 'i', 'should', 'get', 'to', 'know', '\xe2\x80\x93check', 'out', 'his', 'home', 'page', '!', 'i', 'can', '\xe2\x80\x99t', 'promise', 'that', 'you', '\xe2\x80\x99ll', 'have', 'as', 'productive', 'an', 'experience', 'as', 'i', 'did', ',', 'but', 'i', 'encourage', 'you', 'to', 'try', 'eric', '\xe2\x80\x99s', 'demo', 'it', '\xe2\x80\x99s', 'simple', 'examples', 'like', 'these', 'that', 'remind', 'me', 'of', 'the', 'value', 'of', 'pursuing', 'hcir', 'for', 'the', 'open', 'web', 'speaking', 'of', 'which', ',', 'hcir', '2010', 'is', 'in', 'the', 'works', 'we', '\xe2\x80\x99ll', 'flesh', 'out', 'the', 'details', 'over', 'the', 'next', 'weeks', ',', 'and', 'of', 'course', 'i', '\xe2\x80\x99ll', 'share', 'them', 'here']
>>> my_clean_list = [unidecode(x.decode('utf8')) for x in my_list]
>>> my_clean_list
['guest', 'demo', ':', 'eric', 'iverson', "'s", 'itty', 'bitty', 'search', 'february', '16', 'th', ',', '2010', 'by', 'daniel', 'tunkelang', 'respond', 'i', "'m", 'back', 'from', 'vacation', ',', 'and', 'still', 'digging', 'my', 'way', 'out', 'of', 'everything', 'that', "'s", 'piled', 'up', 'while', 'i', "'ve", 'been', 'offline', 'while', 'i', 'catch', 'up', ',', 'i', 'thought', 'i', "'d", 'share', 'with', 'you', 'a', 'demo', 'that', 'eric', 'iverson', 'was', 'gracious', 'enough', 'to', 'share', 'with', 'me', 'it', 'uses', 'yahoo', '!', 'boss', 'to', 'support', 'an', 'exploratory', 'search', 'experience', 'on', 'top', 'of', 'a', 'general', 'web', 'search', 'engine', 'when', 'you', 'perform', 'a', 'query', ',', 'the', 'application', 'retrieves', 'a', 'set', 'of', 'related', 'term', 'candidates', 'using', 'yahoo', "'s", 'key', 'terms', 'api', 'it', 'then', 'scores', 'each', 'term', 'by', 'dividing', 'it', 'is', 'occurrence', 'count', 'within', 'the', 'result', 'set', 'by', 'it', 'is', 'global', 'occurrence', 'count', '-a', 'relevance', 'measure', 'similar', 'to', 'one', 'my', 'former', 'colleagues', 'and', 'i', 'used', 'at', 'endeca', 'in', 'enterprise', 'contexts', 'you', 'can', 'try', 'out', 'the', 'demo', 'yourself', 'at', 'http', '://www', 'ittybittysearch', 'com', '/', 'while', 'it', 'has', 'rough', 'edges', ',', 'it', 'produces', 'nice', 'results', '-especially', 'considering', 'the', 'simplicity', 'of', 'the', 'approach', 'here', "'s", 'an', 'example', 'of', 'how', 'i', 'used', 'the', 'application', 'to', 'explore', 'and', 'learn', 'something', 'new', 'i', 'started', 'with', '["', 'information', 'retrieval', '"]', 'i', 'noticed', '"', 'interactive', 'information', 'retrieval', '"', 'as', 'a', 'top', 'term', ',', 'so', 'i', 'used', 'it', 'to', 'refine', 'most', 'of', 'the', 'refinement', 'suggestions', 'looked', 'familiar', 'to', 'me', '-but', 'an', 'unfamiliar', 'name', 'caught', 'my', 'attention', ':', '"', 'anton', 'leuski', '"', 'following', 'my', 'curiosity', ',', 'i', 'refined', 'again', 'looking', 'at', 'the', 'results', ',', 'i', 'immediately', 'saw', 'that', 'leuski', 'had', 'done', 'work', 'on', 'evaluating', 'document', 'clustering', 'for', 'interactive', 'information', 'retrieval', 'further', 'exploration', 'made', 'it', 'clear', 'this', 'is', 'someone', 'whose', 'work', 'i', 'should', 'get', 'to', 'know', '-check', 'out', 'his', 'home', 'page', '!', 'i', 'can', "'t", 'promise', 'that', 'you', "'ll", 'have', 'as', 'productive', 'an', 'experience', 'as', 'i', 'did', ',', 'but', 'i', 'encourage', 'you', 'to', 'try', 'eric', "'s", 'demo', 'it', "'s", 'simple', 'examples', 'like', 'these', 'that', 'remind', 'me', 'of', 'the', 'value', 'of', 'pursuing', 'hcir', 'for', 'the', 'open', 'web', 'speaking', 'of', 'which', ',', 'hcir', '2010', 'is', 'in', 'the', 'works', 'we', "'ll", 'flesh', 'out', 'the', 'details', 'over', 'the', 'next', 'weeks', ',', 'and', 'of', 'course', 'i', "'ll", 'share', 'them', 'here']
Here I am using unidecode module to transform those "fancy" characters into nearest ascii equivalents:
>>> for before, after in zip(my_list, my_clean_list):
... if before != after:
... print before, ' --> ', after
...
’s --> 's
’m --> 'm
’s --> 's
’ve --> 've
’d --> 'd
’s --> 's
–a --> -a
–especially --> -especially
’s --> 's
“ --> "
” --> "
–but --> -but
“ --> "
” --> "
–check --> -check
’t --> 't
’ll --> 'll
’s --> 's
’s --> 's
’ll --> 'll
’ll --> 'll
As you can probably guess, it looks like some English data was supposed to be split at word boundaries and this was done incorrectly. If it is your code which generates this data I suggest you solve your issue closer to the source of the problem!
Upvotes: 3
Reputation: 74615
You could use a list comprehension. Assuming you just want to completely remove the elements of your list containing non-alphanumeric characters. If your list was in a variable a
:
[x for x in a if x.isalnum()]
Would return the list, minus the elements with \xe2\x80\x99
, etc.
This is equivalent to the filter solution mentioned by @ssm, they just got there first with it.
Upvotes: 0
Reputation: 5373
Looks like you have a bunch of unicode strings you want to eliminate. Just chose the alpha numeric characters in the list like so:
>>> filter( lambda m: m.isalnum() ,p)
That should eliminate the unicode stuff ...
The other option is to encode and decode the string directly ...
>>> ' '.join(p).decode('ascii', 'ignore').encode('ascii').split()
This should do a much better job ...
Upvotes: 1