Reputation: 13
I am trying to split a string on multiple lines of a csv into three substrings, which I need to remain on the same line while also adding single quotation marks on sub-string 2 and 3 on the line followed by a comma.
The lines in the csv be in the following format:
12345678/ABCDE.pdf
12345678/ABCDE.pdf
12345678/ABCDE.pdf
As I am new to Python, I have tried a split on the lines which returns the first two sub-strings without the / but I am not sure how to obtain the final desired output.
'12345678', 'ABCDE.pdf'
I would like the output to look like the below
12345678,'/ABCDE.pdf','ABCDE',
12345678,'/ABCDE.pdf','ABCDE',
12345678,'/ABCDE.pdf','ABCDE',
with the final string containing the title of the pdf without the file extension.
Any help would be greatly appreciated.
Upvotes: 1
Views: 157
Reputation: 2553
Using split again, you can easily construct the desired output string without the need for regex.
In [22]: %%timeit
...: s = '''12345678/ABCDE.pdf
...: 12345678/ABCDE.pdf
...: 12345678/ABCDE.pdf'''
...: for l in s.splitlines():
...: s_parts = l.split('/')
...: new_s = '{},\'/{}\',\'{}\','.format(s_parts[0], s_parts[1], s_parts[1].split('.')[0])
...:
100000 loops, best of 3: 3.55 µs per loop
Output:
Out[24]: "12345678,'/ABCDE.pdf','ABCDE',"
For comparison, the regex solution posted which also works fine has the following runtime performance. The performance delta here is not too significant, but with a large number of items to process, it could be a factor.
In [25]: %%timeit
...: s = ["12345678/ABCDE.pdf",
...: "12345678/ABCDE.pdf",
...: "12345678/ABCDE.pdf"]
...: new_s = [[re.findall("\d+", i)[0], "/"+i.split("/")[-1], re.findall("[A
...: -Z]+", i)[0]] for i in s]
...:
100000 loops, best of 3: 11.6 µs per loop
Upvotes: 1
Reputation: 71451
You can use re.split()
and re.findall()
:
s = ["12345678/ABCDE.pdf",
"12345678/ABCDE.pdf",
"12345678/ABCDE.pdf"]
new_s = [[re.findall("\d+", i)[0], "/"+i.split("/")[-1], re.findall("[A-Z]+", i)[0]] for i in s]
Output:
[['12345678', '/ABCDE.pdf', 'ABCDE'], ['12345678', '/ABCDE.pdf', 'ABCDE'], ['12345678', '/ABCDE.pdf', 'ABCDE']]
Upvotes: 0