jayeshkv
jayeshkv

Reputation: 2208

Select substring from an RDD in pyspark

I am running pyspark via python3 and currently i have an RDD cities

[u'California - LA', u'Memphis, TN', u'', u'London, England', u'', u'', u'Ohio', u'Burlington, Ontario, Canada', u'', u'', u'North Carolina', u'Wisner, LA', u'', u'', u'Beverly Hills, CA', u'Toronto, Ontario, Canada', u'United States', u'', u'', u'', u'Mineola, AR', u'Washington', u'Dubai UAE', u'Morris Plains, NJ', u'Nevada, MO', u'', u'', u'Georgia', u'New York, NY [Spanish Harlem]', u'Newark, NJ', u'Chicago', u'Brandon', u'Queens, NY', u'Beaumont, TX', u'Houston, TX', u'', u'San Antonio, TX', u'', u'', u'', u'California - LA', u'Detroit, MI', u'London, England', u'', u'Chapel Hill, NC', u'Oxford, MS', u'Dallas, TX', u'', u'', u'Berlin, Germany', u'New York, NY', u'Sao Paulo, Brazil', u'South Jamaica, Queens', u'Los Angeles, CA', u'', u'Middlesbrough, England', u'', u'London, England', u'Egremont, Cumbria, England', u'Garnant, Wales', u'California - SF', u'', u'Melbourne, Australia', u'Nashville, Tennessee', u'', u'', u'Berkeley, CA', u'', u'', u'AUSTRALIA', u'', u'', u'Jamaica, West Indies', u'Pasadena, CA', u'Los Angeles, CA', u'Cleveland, OH', u'', u'', u'New York, NY', u'', u'Minnesota', u'Norway', u'FR', u'Delight, AR', u'Humboldt, TN', u'Tampa, FL', u'CA', u'', u'Birmingham, AL', u'Manchester, England', u'', u'', u'', u'B\xe9zu, comme Superdupont, ne connait qu'un pays : la France !', u'Newark, NJ', u'', u'New York, NY', u'', u'Aarhus, Denmark', u'Sicily Island, LA']

and i want a new RDD just with the state name instead of City - State, the common delimiter would be a word just before ,[space] [word]', what would my lambda function look like

Upvotes: 2

Views: 4378

Answers (1)

user1952500
user1952500

Reputation: 6771

How about:

re.findall("[A-Za-z]+", "Toronto, Ontario, Canada", 0)[1]
'Ontario'

re.findall("[A-Za-z]+", "California - LA", 0)[1]
'LA'

This basically finds words composed only of alphabets and outputs the second element in the list.

Upvotes: 2

Related Questions