Reputation: 572
I want to extract several links from a text (comments), which are stored in a panda dataframe. My goal is to add the extracted URLs in a new column of the original dataset. By using the following method applied to my text, I am able to extract the comments and store it in the variable URL and transform it into another pandas dataframe. In this case, I am not sure if this is the efficient way to extract the necessary information.
URL = (ALL.textOriginal.str.extractall("(?P<URL>(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300})))").reset_index('match', drop=True))
URL_df = pd.DataFrame(data=URL)
URL_df.drop([1],axis=1)
gives me:
596 https://www.tag24.de/nachrichten
596 http://www.tt.com/panorama
596 http://www.wz.de/lokales
666 https://www.svz.de/regionales
666 https://www.watson.ch/Leben
... ...
The dataframe contains only the indices and the hyperlinks. The problem with this method is, that some of the indices are duplicated because one comment can exist several URLs, which will be extracted. I tried different ways to solve this problem such as:
pd.concat([ALL, URL_df.drop], axis=1).sort_index()
I also tried to store it the URLs directly to the original dataframe by applying:
ALL['URL'] = ALL.textOriginal.str.extractall("(?P<URL>(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300})))").reset_index('match', drop=True))
but I only received this error message:
"incompatible index of the inserted column with frame index"
As I said before my goal is to store the extracted URLs in a new column like:
text URL
"blablabla link1, link2, link3" [https://www.tag24.de/nachrichten, http://www.tt.com/panorama, http://www.wz.de/lokales]
"blablabla link1, link2" [https://www.svz.de/regionales, https://www.watson.ch/Leben]
... ...
Upvotes: 2
Views: 801
Reputation: 862511
I think need findall
:
pat = "(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300}))"
ALL['URL'] = ALL.textOriginal.str.findall(pat)
print (ALL)
textOriginal \
0 https://www.tag24.de/nachrichten http://www.tt...
1 https://www.svz.de/regionales https://www.wats...
URL
0 [https://www.tag24.de/nachrichten, http://www....
1 [https://www.svz.de/regionales, https://www.wa... ]
Another solution with extractall
, which return MultiIndex
, so necessary groupby
by first level with creating list
s:
pat = "(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300}))"
ALL['URL'] = ALL.textOriginal.str.extractall(pat).groupby(level=0)[0].apply(list)
print (ALL)
textOriginal \
0 https://www.tag24.de/nachrichten http://www.tt...
1 https://www.svz.de/regionales https://www.wats...
URL
0 [https://www.tag24.de/nachrichten, http://www....
1 [https://www.svz.de/regionales, https://www.wa...
Setup:
ALL = pd.DataFrame({'textOriginal': ['https://www.tag24.de/nachrichten http://www.tt.com/panorama http://www.wz.de/lokales', 'https://www.svz.de/regionales https://www.watson.ch/Leben']})
Upvotes: 3
Reputation: 176
Let's say you have a dataframe with two columns, 'Indice' and 'Link', where Indice is not unique. You can aggregate all the links with the same Indice in the following way:
myAggregateDF = myDF.groupby('Indice')['Link'].apply(list).to_frame()
In this way, you will obtain a new dataframe with two columns, 'Indice' and 'Link', where 'Link' is a list of the previous links.
Pay attention though, this method is not efficient. Groupby is memory hungry and this can be a problem with large dataframes.
Upvotes: 1