Reputation: 3
first time posting here.
I'd like to 1) parse the following text:"keyword: some keywords concept :some concepts"
and 2) store into the dictionary: ['keyword']=>'some keywords', ['concept']=>'some concepts'
.
There may be 0 or 1 'space' before each 'colon'. The following is what I've tried so far.
sample_text = "keyword: some keywords concept :some concepts"
p_res = re.compile("(\S+\s?):").split(sample_text) # Task 1
d_inc = dict([(k, v) for k,v in zip (p_res[::2], p_res[1::2])]) # Task 2
However, the list result p_res
is wrong , with empty entry at the index 0, which consequently produce wrong dict. Is there something wrong with my regex?
Upvotes: 0
Views: 2893
Reputation: 4863
Simply replace Task1 by this line:
p_res = re.compile("(\S+\s?):").split(sample_text)[1:] # Task 1
This will always ignore the (normally empty) element that is returned by re.split
.
Background: Why does re.split
return the empty first result?
What should the program do with this input:
sample_text = "Hello! keyword: some keywords concept :some concepts"
The text Hello!
at the beginning of the input doesn't fit into the definition of your problem (which assumes that the input starts with a key).
Do you want to ignore it? Do you want to raise an exception if it appears? Do you want to want to add it to your dictionary with a special key?
re.split
doesn't want to decide this for you: It returns whatever information appears and you make your decision. In our solution, we simply ignore whatever appears before the first key.
Upvotes: 0
Reputation: 174874
Use re.findall
to capture list of groups in a match. And then apply dict
to convert list of tuples to dict.
>>> import re
>>> s = 'keyword: some keywords concept :some concepts'
>>> dict(re.findall(r'(\S+)\s*:\s*(.*?)\s*(?=\S+\s*:|$)', s))
{'concept': 'some concepts', 'keyword': 'some keywords'}
>>>
Above regex would capture key and it's corresponding value in two separate groups.
I assume that the input string contain only key value pair and the key won't contain any space character.
Upvotes: 3