Reputation: 8481
GLoVe pre-trained word vectors which can be downloaded here (https://nlp.stanford.edu/projects/glove/) have the following file format:
government 0.38797 -1.0825 0.45025 -0.23341 0.086307 -0.25721 -0.18281 -0.10037 -0.50099 -0.58361 -0.052635 -0.14224 0.0090217 -0.38308 0.18503 0.42444 0.10611 -0.1487 1.0801 0.065757 0.64552 0.1908 -0.14561 -0.87237 -0.35568 -2.435 0.28428 -0.33436 -0.56139 0.91404 4.0129 0.072234 -1.2478 -0.36592 -0.50236 0.011731 -0.27409 -0.50842 -0.2584 -0.096172 -0.67109 0.40226 0.27912 -0.37317 -0.45049 -0.30662 -1.6426 1.1936 0.65343 -0.76293
It's a space-delimited file where the first token in each row is the word and the N remaining columns are floating point values for the word vector. N can be 50, 100, 200, or 300 depending on the file being used. The example above is for N=50
(i.e. 50-dimensional word vectors).
If I load the data file as a csv
with sep=' '
and header=False
(there is no header in the file), I get the following for a row:
Row(_c0='the', _c1='0.418', _c2='0.24968', _c3='-0.41242', _c4='0.1217', _c5='0.34527', _c6='-0.044457', _c7='-0.49688', _c8='-0.17862', _c9='-0.00066023', _c10='-0.6566', _c11='0.27843', _c12='-0.14767', _c13='-0.55677', _c14='0.14658', _c15='-0.0095095', _c16='0.011658', _c17='0.10204', _c18='-0.12792', _c19='-0.8443', _c20='-0.12181', _c21='-0.016801', _c22='-0.33279', _c23='-0.1552', _c24='-0.23131', _c25='-0.19181', _c26='-1.8823', _c27='-0.76746', _c28='0.099051', _c29='-0.42125', _c30='-0.19526', _c31='4.0071', _c32='-0.18594', _c33='-0.52287', _c34='-0.31681', _c35='0.00059213', _c36='0.0074449', _c37='0.17778', _c38='-0.15897', _c39='0.012041', _c40='-0.054223', _c41='-0.29871', _c42='-0.15749', _c43='-0.34758', _c44='-0.045637', _c45='-0.44251', _c46='0.18785', _c47='0.0027849', _c48='-0.18411', _c49='-0.11514', _c50='-0.78581')
My question is whether there is a way to specify a schema such that the first column could be read in as a StringType
column and the N remaining columns read as an ArrayType
of N floating point values?
Upvotes: 4
Views: 2583
Reputation: 2146
You may try the following pyspark method to get the desired schema, which you can then use as the schema() option when loading your GLoVe data. The idea is to use a loop to define float types for the N-dimensional embedding. Granted, it's slightly hacky and there may be a more elegant solution.
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import IntegerType, StringType, FloatType
def make_glove_schema(keyword="word", N=50):
"""Make a GloVe schema of length N + 1, with 1 : N+1 as float types.
Params
------
keyword (str): name of the first, i.e. 0th index column
N (int): dimension of GLoVe representation
Returns
-------
S (pyspark.sql.types.StructType): schema to use when loading GLoVe data.
"""
a = StructType([StructField(keyword, StringType())])
b = StructType([StructField(str(x), FloatType()) for x in range(N)])
x = [a.fields + b.fields]
Z = StructType(x[0])
#type(Z) == type(a) == type(b) # True
return Z
Then you may reference/load your (tab-separated) file as follows, assuming you've specified or have a sqlContext:
glove_schema = make_glove_schema()
f = "path_to_your_glove_data"
df_glove = sqlContext.read.format("csv").\
option("delimiter","\t").\
schema(glove_schema).\
load(f)
Upvotes: 1