Reputation: 385
I have a dataset consisting of hotel reviews, ratings, and other features such as traveler type, and word count of the review. I want to perform topic modeling (LDA) and use the topics derived from the reviews as well as other features to identify the features that most affects the ratings (ratings as the dependent variable).
If I want to use linear regression to do this, does this mean I would have to label each review with the topics derived? Is there a way to do this in R or will I have to manually label each review? (I am new to text mining and data science in general.)
Upvotes: 1
Views: 1498
Reputation: 14425
The short answer : you don't have to label each review with the topics derived because you'd be relying on the topic model you train to determine the topics of the reviews, which would then be used to construct features for your regression model.
There is a good explanation of topic modeling with code samples (in R) at
www.tidytextmining.com/topicmodeling.html. Sections 6.2.1
and 6.2.2
should help you quickly get started.
Keeping in mind the following two principles
once a topic model has been trained on the reviews, for every review,
A simplified example : there might be 4 topics the reviews broadly fall under.
The document-topic probabilities combined with the top terms of each topic can be used as features similar to :
topic_1_location_probability
topic_2_hotel_staff_probability
topic_3_hotel_room_probability
topic_4_hotel_amenities_probability
is_convenient_location
is_train_station_nearby
is_walk_distance
is_clean
is_late_checkout
is_fitness_centre
For newer reviews :
I hope this helps you.
Upvotes: 3