Evan Zamir
Evan Zamir

Reputation: 8481

Creating Spark schema for GLoVe word vector files

GLoVe pre-trained word vectors which can be downloaded here (https://nlp.stanford.edu/projects/glove/) have the following file format:

government 0.38797 -1.0825 0.45025 -0.23341 0.086307 -0.25721 -0.18281 -0.10037 -0.50099 -0.58361 -0.052635 -0.14224 0.0090217 -0.38308 0.18503 0.42444 0.10611 -0.1487 1.0801 0.065757 0.64552 0.1908 -0.14561 -0.87237 -0.35568 -2.435 0.28428 -0.33436 -0.56139 0.91404 4.0129 0.072234 -1.2478 -0.36592 -0.50236 0.011731 -0.27409 -0.50842 -0.2584 -0.096172 -0.67109 0.40226 0.27912 -0.37317 -0.45049 -0.30662 -1.6426 1.1936 0.65343 -0.76293

It's a space-delimited file where the first token in each row is the word and the N remaining columns are floating point values for the word vector. N can be 50, 100, 200, or 300 depending on the file being used. The example above is for N=50 (i.e. 50-dimensional word vectors).

If I load the data file as a csv with sep=' ' and header=False (there is no header in the file), I get the following for a row:

Row(_c0='the', _c1='0.418', _c2='0.24968', _c3='-0.41242', _c4='0.1217', _c5='0.34527', _c6='-0.044457', _c7='-0.49688', _c8='-0.17862', _c9='-0.00066023', _c10='-0.6566', _c11='0.27843', _c12='-0.14767', _c13='-0.55677', _c14='0.14658', _c15='-0.0095095', _c16='0.011658', _c17='0.10204', _c18='-0.12792', _c19='-0.8443', _c20='-0.12181', _c21='-0.016801', _c22='-0.33279', _c23='-0.1552', _c24='-0.23131', _c25='-0.19181', _c26='-1.8823', _c27='-0.76746', _c28='0.099051', _c29='-0.42125', _c30='-0.19526', _c31='4.0071', _c32='-0.18594', _c33='-0.52287', _c34='-0.31681', _c35='0.00059213', _c36='0.0074449', _c37='0.17778', _c38='-0.15897', _c39='0.012041', _c40='-0.054223', _c41='-0.29871', _c42='-0.15749', _c43='-0.34758', _c44='-0.045637', _c45='-0.44251', _c46='0.18785', _c47='0.0027849', _c48='-0.18411', _c49='-0.11514', _c50='-0.78581')

My question is whether there is a way to specify a schema such that the first column could be read in as a StringType column and the N remaining columns read as an ArrayType of N floating point values?

Upvotes: 4

Views: 2583

Answers (1)

Quetzalcoatl
Quetzalcoatl

Reputation: 2146

You may try the following pyspark method to get the desired schema, which you can then use as the schema() option when loading your GLoVe data. The idea is to use a loop to define float types for the N-dimensional embedding. Granted, it's slightly hacky and there may be a more elegant solution.

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import IntegerType, StringType, FloatType

def make_glove_schema(keyword="word", N=50):
    """Make a GloVe schema of length N + 1, with 1 : N+1 as float types.

    Params
    ------
    keyword (str): name of the first, i.e. 0th index column
    N (int): dimension of GLoVe representation

    Returns
    -------
    S (pyspark.sql.types.StructType): schema to use when loading GLoVe data.

    """
    a = StructType([StructField(keyword, StringType())])
    b = StructType([StructField(str(x), FloatType()) for x in range(N)])
    x = [a.fields + b.fields]
    Z = StructType(x[0])
    #type(Z) == type(a) == type(b)   # True
    return Z

Then you may reference/load your (tab-separated) file as follows, assuming you've specified or have a sqlContext:

glove_schema = make_glove_schema()
f = "path_to_your_glove_data"
df_glove = sqlContext.read.format("csv").\
                option("delimiter","\t").\
                schema(glove_schema).\
                load(f)

Upvotes: 1

Related Questions