Reputation: 3
I have a large text document that I am reading in and attempting to split into a multiple list. I'm having a hard time with the logic behind actually splitting up the string.
example of the text:
Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410
This data contains 4 pieces of information in this format:
City[coordinates]Population Distances_to_previous
My aim is to split this data up into a List:
Data = [[City] , [Coordinates] , [Population] , [Distances]]
As far as I know I need to use .split statements but I've gotten lost trying to implement them.
I'd be very grateful for some ideas to get started!
Upvotes: 0
Views: 113
Reputation: 54273
The easiest way would be with a regex.
lines = """Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410"""
import re
pat = re.compile(r"""
(?P<City>.+?) # all characters up to the first [
\[(?P<Coordinates>\d+,\d+)\] # grabs [(digits,here)]
(?P<Population>\d+) # population digits here
\s # a space or a newline?
(?P<Distances>[\d ]+)? # Everything else is distances""", re.M | re.X)
groups = pat.finditer(lines)
results = [[[g.group("City")],
[g.group("Coordinates")],
[g.group("Population")],
g.group("Distances").split() if
g.group("Distances") else [None]]
for g in groups]
DEMO:
In[50]: results
Out[50]:
[[['Youngstown, OH'], ['4110,8065'], ['115436'], [None]],
[['Yankton, SD'], ['4288,9739'], ['12011'], ['966']],
[['Yakima, WA'], ['4660,12051'], ['49826'], ['1513', '2410']]]
Though if I may, it's probably BEST to do this as a list of dictionaries.
groups = pat.finditer(lines)
results = [{key: g.group(key)} for g in groups for key in
["City", "Coordinates", "Population", "Distances"]]
# then modify later
for d in results:
try:
d['Distances'] = d['Distances'].split()
except AttributeError:
# distances is None -- that's okay
pass
Upvotes: 1
Reputation: 623
I would do this in stages.
I'd start with something like:
numCities = 0
Data = []
i = 0
while i < len(lines):
split = lines[i].partition('[')
if (split[1]): # We found something
city = split[0]
split = split[2].partition(']')
if (split[1]):
coords = split[0] #If you want this as a list then rsplit it
population = split[2]
distances = []
if i > 0:
i += 1
distances = lines[i].rsplit(' ')
Data.append([city, coords, population, distances])
numCities += 1
i += 1
for data in Data:
print (data)
This will print
['Youngstown, OH', '4110,8065', '115436', []]
['Yankton, SD', '4288,9739', '12011', ['966']]
['Yakima, WA', '4660,12051', '49826', ['1513', '2410']]
Upvotes: 1