Reputation: 99
I am trying to run some basic analyses on a .txt file filled with car data. I have read in the file to Python and am trying to split it into the appropriate columns, but what will be the "first column," the car name, sometimes has multiple words. For example, below are two lines with some of the information that my file has:
Therefore, when I split each line by spaces alone, I end up with lists of different sizes--in the example above, the "Chevy" and "Nova" would be separated from one another.
I have figured out a way to identify the the portion of each line that represents the car name:
for line in cardata:
if line == line[0]: #for header line
continue
else:
line = line.rstrip()
carnamebreakpoint = line.find('7/')
print carnamebreakpoint
carname = line[:carnamebreakpoint]
print carname
What I'd like to do now is tell python to split by space after the carname (with the end goal of a list that looks like [carname, date, color, number sold]), but I've tried playing around with the .split() function to do this with no luck thus far. I'd love some guidance on how to proceed, as I'm fairly new to programming.
Thanks in advance for any help!
Upvotes: 1
Views: 800
Reputation: 2130
Depending on how confident you are on the format of your data your solution might not be the best one.
What would happen if you get a car with a date different from the 7th of some month? And what about the color "Light Blue".
This kind of task fit quite well the use case for regex.
For instance given a regex of this kind would let you easily isolate the 4 components:
^(.*) (\d{1,2}/\d{1,2}/\d{4}) (.*) ([\d,]+)$
In python you can use it like this:
import re
s = "Chevy Nova 7/1/2000 Blue 28,000"
m = re.match(r"^(.*) (\d{1,2}/\d{1,2}/\d{4}) (.*) ([\d,]+)$", s)
m.group(1) # => Chevy Nova
m.group(2) # => 7/1/2000
m.group(3) # => Blue
m.group(4) # => 28,0000
And if you have a string with multiple lines you could batch process them like this:
s = """Chevy Nova 7/1/2000 Blue 28,000
Chevy Nova 10/6/2002 Light Blue 28,000
Cadillac 7/1/2001 Silver 30,000"""
re.findall(r"^(.*) (\d{1,2}/\d{1,2}/\d{4}) (.*) ([\d,]+)$", s, flags=re.MULTILINE)
# => [('Chevy Nova', '7/1/2000', 'Blue', '28,000'),
# => ('Chevy Nova', '10/6/2002', 'Light Blue', '28,000'),
# => ('Cadillac', '7/1/2001', 'Silver', '30,000')]
Upvotes: 0
Reputation: 180391
s = "Chevy Nova 7/1/2000 Blue 28,000"
s.rsplit(None,3)
It will only split 3 times from the end of the string:
In [4]: s = "Chevy Nova 7/1/2000 Blue 28,000"
In [5]: s.rsplit(None,3)
Out[5]: ['Chevy Nova', '7/1/2000', 'Blue', '28,000']
In [8]: s ="Car Date Color Quantity "
In [9]: s.rsplit(None,3)
Out[9]: ['Car', 'Date', 'Color', 'Quantity']
This presumes that the last three items will always be single word strings like in your example which should be correct or else you indexing approach will also fail.
Also to ignore the header you can call next() on the file object.
with open("your_file.txt") as f:
header = next(f)
for line in f:
car_name,date,col,mile = line.rstrip().rsplit(None,3)
print(car_name,date,col,mile)
('Chevy Nova', '7/1/2000', 'Blue', '28,000')
('Cadillac', '7/1/2001', 'Silver', '30,000')
Upvotes: 2
Reputation: 13016
First slice the string at the breakpoint, then call split()
on the result:
date, color, quantity = line[breakpoint:].split()
Upvotes: 1