SpicyClubSauce
SpicyClubSauce

Reputation: 4256

How to write a regular expression to retrieve genres from this string?

If I want to extract a list of just ['Horror', 'Adult', 'Cult Movies', etc..] from this class 'pandas.core.index.Index, what would be best regex for this? Something that grabs everything following capital Ts until a closed bracket?

But then is that a bad approach, given "Television" starts with a capital T? What should the approach be here? I've never used regex before.

Index([u'variable[T.Horror]', u'variable[T.Adult]', u'variable[T.Cult Movies]', u'variable[T.Mystery & Suspense]', u'variable[T.Science Fiction & Fantasy]', u'variable[T.Western]', u'variable[T.Gay & Lesbian]', u'Q("Tomato-meter")', u'variable[T.Comedy]', u'variable[T.Television]', u'variable[T.Kids & Family]', u'variable[T.Classics]', u'variable[T.Drama]', u'variable[T.Art House & International]', u'variable[T.Romance]', u'variable[T.Special Interest]', u'variable[T.Animation]', u'variable[T.Documentary]', u'variable[T.Musical & Performing Arts]', u'variable[T.Sports & Fitness]', u'variable[T.Faith & Spirituality]', u'variable[T.Anime & Manga]', u'Intercept'], dtype='object')

Upvotes: 0

Views: 72

Answers (2)

Kasravnd
Kasravnd

Reputation: 107297

You can use following regex within a list comprehension :

>>> import re
>>> regx=re.compile(r'(?<=\[T\.)([^\]]+)(?=\])')
>>> [regx.search(i).group() for i in mylist if '[' in i]
[u'Horror', u'Adult', u'Cult Movies', u'Mystery & Suspense', u'Science Fiction & Fantasy', u'Western', u'Gay & Lesbian', u'Comedy', u'Television', u'Kids & Family', u'Classics', u'Drama', u'Art House & International', u'Romance', u'Special Interest', u'Animation', u'Documentary', u'Musical & Performing Arts', u'Sports & Fitness', u'Faith & Spirituality', u'Anime & Manga']

This regex used positive look-around which will match every thins without ] between (?<=\[T\.) and (?=\]).

Also note that as a more pythonic and optimized way I used re.compile to compile your regex outside your list-comprehension to refuse of compiling the regex in each iteration.

Upvotes: 1

karthik manchala
karthik manchala

Reputation: 13640

You can use the following regex:

(?<=T\.)([^\]]+)

See DEMO

Upvotes: 1

Related Questions