Reputation: 2114
I'm working with the Yahoo! Webscope dataset ydata-frontpage-todaymodule-clicks-v1_0 (specifically, the click logs for the first ten days in May 2009). The dataset description states that each user and article has 6 features, numbered 1 through 6. Feature #1 is a constant (always 1), and features #2-6 are constructed via conjoint analysis. The format is feature_id:feature_value pairs.
However, I've found instances where an article has a feature with feature_id = 7. Here's an example line where this occurs:
1241196300 109522 0 |user 2:0.008078 3:0.005109 4:0.000172 5:0.007422 6:0.979220 1:1.000000 |109523 2:0.316894 3:0.000023 4:0.210890 5:0.198013 6:0.274180 1:1.000000 |109498 2:0.306008 3:0.000450 4:0.077048 5:0.230439 6:0.386055 1:1.000000 |109509 2:0.306008 3:0.000450 4:0.077048 5:0.230439 6:0.386055 1:1.000000 |109508 2:0.264355 3:0.000012 4:0.037393 5:0.420649 6:0.277591 1:1.000000 |109473 2:0.295442 3:0.000014 4:0.135191 5:0.292304 6:0.277050 1:1.000000 |109524 2:0.274868 3:0.000032 4:0.046639 5:0.362209 6:0.316252 1:1.000000 |109527 2:0.375829 3:0.000025 4:0.033041 5:0.349637 6:0.241468 1:1.000000 |109520 2:0.016328 3:0.953419 4:0.000538 5:0.008263 6:0.021452 1:1.000000 |109503 2:0.306008 3:0.000450 4:0.077048 5:0.230439 6:0.386055 1:1.000000 |109510 2:0.287909 3:0.000025 4:0.008983 5:0.511333 6:0.191751 1:1.000000 |109526 2:0.432433 3:0.000002 4:0.069055 5:0.351774 6:0.146736 1:1.000000 |109495 2:0.313277 3:0.000125 4:0.018413 5:0.410555 6:0.257630 1:1.000000 |109506 2:0.264355 3:0.000012 4:0.037393 5:0.420649 6:0.277591 1:1.000000 |109512 2:0.297322 3:0.000025 4:0.034951 5:0.413566 6:0.254137 1:1.000000 |109511 2:0.381149 3:0.000129 4:0.060038 5:0.269129 6:0.289554 1:1.000000 |109514 2:0.297750 3:0.000013 4:0.011603 5:0.512182 6:0.178452 1:1.000000 |109528 7:1.000000 |109522 2:0.214605 3:0.000037 4:0.410493 5:0.097704 6:0.277162 1:1.000000 |109515 2:0.281649 3:0.000173 4:0.195994 5:0.151003 6:0.371182 1:1.000000 |109525 2:0.306008 3:0.000450 4:0.077048 5:0.230439 6:0.386055 1:1.000000 |109513 2:0.211406 3:0.000036 4:0.002773 5:0.569886 6:0.215900 1:1.000000
Specifically, the article with article_id=109528 has the feature 7:1.000000, which is not expected. Has anyone else encountered this issue with this dataset? Any insights on why this discrepancy might exist and how it can be handled when parsing the data? Is this indicative of a potential broader problem within the dataset?
Upvotes: 0
Views: 11