Niall Walsh
Niall Walsh

Reputation: 13

Python split string to multiple substrings with single quotations and a trailing comma

I am trying to split a string on multiple lines of a csv into three substrings, which I need to remain on the same line while also adding single quotation marks on sub-string 2 and 3 on the line followed by a comma.

The lines in the csv be in the following format:

12345678/ABCDE.pdf
12345678/ABCDE.pdf
12345678/ABCDE.pdf

As I am new to Python, I have tried a split on the lines which returns the first two sub-strings without the / but I am not sure how to obtain the final desired output.

'12345678', 'ABCDE.pdf'

I would like the output to look like the below

12345678,'/ABCDE.pdf','ABCDE',
12345678,'/ABCDE.pdf','ABCDE',
12345678,'/ABCDE.pdf','ABCDE',

with the final string containing the title of the pdf without the file extension.

Any help would be greatly appreciated.

Upvotes: 1

Views: 157

Answers (2)

tdube
tdube

Reputation: 2553

Using split again, you can easily construct the desired output string without the need for regex.

In [22]: %%timeit
    ...: s = '''12345678/ABCDE.pdf
    ...: 12345678/ABCDE.pdf
    ...: 12345678/ABCDE.pdf'''
    ...: for l in s.splitlines():
    ...:     s_parts = l.split('/')
    ...:     new_s = '{},\'/{}\',\'{}\','.format(s_parts[0], s_parts[1], s_parts[1].split('.')[0])
    ...:
100000 loops, best of 3: 3.55 µs per loop

Output:

Out[24]: "12345678,'/ABCDE.pdf','ABCDE',"

For comparison, the regex solution posted which also works fine has the following runtime performance. The performance delta here is not too significant, but with a large number of items to process, it could be a factor.

In [25]: %%timeit
    ...: s = ["12345678/ABCDE.pdf",
    ...:       "12345678/ABCDE.pdf",
    ...:       "12345678/ABCDE.pdf"]
    ...: new_s = [[re.findall("\d+", i)[0], "/"+i.split("/")[-1], re.findall("[A
    ...: -Z]+", i)[0]] for i in s]
    ...:
100000 loops, best of 3: 11.6 µs per loop

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71451

You can use re.split() and re.findall():

s = ["12345678/ABCDE.pdf",
      "12345678/ABCDE.pdf",
      "12345678/ABCDE.pdf"]
new_s = [[re.findall("\d+", i)[0], "/"+i.split("/")[-1], re.findall("[A-Z]+", i)[0]] for i in s]

Output:

[['12345678', '/ABCDE.pdf', 'ABCDE'], ['12345678', '/ABCDE.pdf', 'ABCDE'], ['12345678', '/ABCDE.pdf', 'ABCDE']]

Upvotes: 0

Related Questions