Words embedding extraction

I am using python 2.7, and I have pre-trained embeddings for English. I need to look up for a certain word embedding from this file.

The file has 300 dimensions and is formatted this way:

the -0.0279698616277 -0.00822567637943 -0.066859518431 0.0152934683231 -0.0329719520937 0.0530985715151 0.0346279291928 0.000898163363809 -0.0342044668875 -0.0358478199459 0.0330627337979 -0.0291780565785 -0.050316270082 0.0226246942919 -0.0999551118641 -0.0211768282161 -0.0650169654368 -0.13170513108 0.0136621823624 0.00761099698762 -0.0747038745232 -0.0309831087459 -0.0281774157081 -0.0381752846197 0.000854164869137 0.118230081556 -0.0544820178539 -0.0259578123228 -0.0250848970404 0.0432551614539 0.0604299831315 0.0605994794422 -0.0652365866148 0.0741619690129 -0.0122427203782 -0.0486630776978 0.0266766400501 -0.0575422338293 -0.0120115890454 0.067022888369 0.0563923322428 0.116347799963 0.0272241149902 -0.0271056717851 -0.0876134412848 -0.0160824708647 0.0478176382685 -0.0278610721008 -0.043103116023 -0.123507487497 -0.0286480325182 -0.00985009337681 -0.00749645238334 -0.00322952663845 -0.046423238718 0.103032221776 0.0821490881533 -0.121380150997 -0.00599957532621 -0.0843011157914 -0.0667407039306 0.0204320098169 -0.0953102074899 -0.0644943672828 -0.00133722007224 0.00249399062204 -0.0199877549741 -0.0494372284268 0.00730022281006 0.100155611334 0.0158984940368 0.0919811737074 -0.0762293413195 0.110083862374 0.0495974423547 -0.0737607844265 0.0507363907294 0.01065877457 -0.0101547411817 0.0437805443228 0.0801814086384 -0.0739505163318 0.0359545673486 0.122458949531 -0.0289695742598 0.0247212132806 -0.0799729263198 -0.0204555870693 -0.00530952298573 -0.0580316010527 0.0849861556452 -0.0386267797212 0.0264685290268 -0.0680456213105 0.0826555349612 -0.0264161763876 -0.0995871582083 0.0344213033507 0.0533503097378 0.037602190303 -0.061794122114 -0.00452664681682 -0.025897662482 -0.0804463278447 -0.0725472056937 -0.109343313871 0.0121977936453

I tried using .split(" ") but this will result in splitting the vector as well. Any idea on how to search for a word and extract its vector from the file?

Upvotes: 1

Answers (3)

BoarGules

Reputation: 16942

How about

line = "the -0.0279698616277 -0.00822567637943 -0.0668... etc"
word, vector = line.split(None,1)

Upvotes: 0

cwl

Reputation: 204

I find each dimension has 15 byte or 16 byte if startwith'-'.So, I suggest to use re.

import re
res = re.findall(r'(?:-0|0).[0-9]{13}', str)
print(res)

You can have a try.I don't have the data, so my try is harder.maybe my suggestion is not helpful!

Upvotes: 0

MiniQuark

Reputation: 48436

This code will parse the whole file and build a dict with the embedding vectors:

>>> embeddings = {}
>>> with open("pretrained_embeddings.txt", "rb") as f:
...     for line in f.xreadlines():
...         line = line.decode("utf-8")
...         columns = line.strip().split()
...         embeddings[columns[0]] = [float(n) for n in columns[1:]]
... 
>>> embeddings["the"]
[-0.0279698616277, -0.00822567637943, -0.066859518431, 0.0152934683231, -0.0329719520937, 0.0530985715151, 0.0346279291928, 0.000898163363809, -0.0342044668875, -0.0358478199459, 0.0330627337979, -0.0291780565785, -0.050316270082, 0.0226246942919, -0.0999551118641, -0.0211768282161, -0.0650169654368, -0.13170513108, 0.0136621823624, 0.00761099698762, -0.0747038745232, -0.0309831087459, -0.0281774157081, -0.0381752846197, 0.000854164869137, 0.118230081556, -0.0544820178539, -0.0259578123228, -0.0250848970404, 0.0432551614539, 0.0604299831315, 0.0605994794422, -0.0652365866148, 0.0741619690129, -0.0122427203782, -0.0486630776978, 0.0266766400501, -0.0575422338293, -0.0120115890454, 0.067022888369, 0.0563923322428, 0.116347799963, 0.0272241149902, -0.0271056717851, -0.0876134412848, -0.0160824708647, 0.0478176382685, -0.0278610721008, -0.043103116023, -0.123507487497, -0.0286480325182, -0.00985009337681, -0.00749645238334, -0.00322952663845, -0.046423238718, 0.103032221776, 0.0821490881533, -0.121380150997, -0.00599957532621, -0.0843011157914, -0.0667407039306, 0.0204320098169, -0.0953102074899, -0.0644943672828, -0.00133722007224, 0.00249399062204, -0.0199877549741, -0.0494372284268, 0.00730022281006, 0.100155611334, 0.0158984940368, 0.0919811737074, -0.0762293413195, 0.110083862374, 0.0495974423547, -0.0737607844265, 0.0507363907294, 0.01065877457, -0.0101547411817, 0.0437805443228, 0.0801814086384, -0.0739505163318, 0.0359545673486, 0.122458949531, -0.0289695742598, 0.0247212132806, -0.0799729263198, -0.0204555870693, -0.00530952298573, -0.0580316010527, 0.0849861556452, -0.0386267797212, 0.0264685290268, -0.0680456213105, 0.0826555349612, -0.0264161763876, -0.0995871582083, 0.0344213033507, 0.0533503097378, 0.037602190303, -0.061794122114, -0.00452664681682, -0.025897662482, -0.0804463278447, -0.0725472056937, -0.109343313871, 0.0121977936453]

Notes:

It is very strict with the format. No empty lines, etc.
It is for Python 2. If you want to use Python 3, just replace f.xreadlines() with f.

Upvotes: 1

Words embedding extraction

Answers (3)

Related Questions