Reputation: 2208
I am running pyspark via python3 and currently i have an RDD cities
[u'California - LA', u'Memphis, TN', u'', u'London, England', u'', u'', u'Ohio', u'Burlington, Ontario, Canada', u'', u'', u'North Carolina', u'Wisner, LA', u'', u'', u'Beverly Hills, CA', u'Toronto, Ontario, Canada', u'United States', u'', u'', u'', u'Mineola, AR', u'Washington', u'Dubai UAE', u'Morris Plains, NJ', u'Nevada, MO', u'', u'', u'Georgia', u'New York, NY [Spanish Harlem]', u'Newark, NJ', u'Chicago', u'Brandon', u'Queens, NY', u'Beaumont, TX', u'Houston, TX', u'', u'San Antonio, TX', u'', u'', u'', u'California - LA', u'Detroit, MI', u'London, England', u'', u'Chapel Hill, NC', u'Oxford, MS', u'Dallas, TX', u'', u'', u'Berlin, Germany', u'New York, NY', u'Sao Paulo, Brazil', u'South Jamaica, Queens', u'Los Angeles, CA', u'', u'Middlesbrough, England', u'', u'London, England', u'Egremont, Cumbria, England', u'Garnant, Wales', u'California - SF', u'', u'Melbourne, Australia', u'Nashville, Tennessee', u'', u'', u'Berkeley, CA', u'', u'', u'AUSTRALIA', u'', u'', u'Jamaica, West Indies', u'Pasadena, CA', u'Los Angeles, CA', u'Cleveland, OH', u'', u'', u'New York, NY', u'', u'Minnesota', u'Norway', u'FR', u'Delight, AR', u'Humboldt, TN', u'Tampa, FL', u'CA', u'', u'Birmingham, AL', u'Manchester, England', u'', u'', u'', u'B\xe9zu, comme Superdupont, ne connait qu'un pays : la France !', u'Newark, NJ', u'', u'New York, NY', u'', u'Aarhus, Denmark', u'Sicily Island, LA']
and i want a new RDD just with the state name instead of City - State
,
the common delimiter would be a word just before ,[space] [word]',
what would my lambda function look like
Upvotes: 2
Views: 4378
Reputation: 6771
How about:
re.findall("[A-Za-z]+", "Toronto, Ontario, Canada", 0)[1]
'Ontario'
re.findall("[A-Za-z]+", "California - LA", 0)[1]
'LA'
This basically finds words composed only of alphabets and outputs the second element in the list.
Upvotes: 2