tinman
tinman

Reputation: 3

Quanteda: How can I select and examine a specific feature within a FCM?

I have a feature co-occurrence matrix of 8,347 by 8,347 with tri = FALSE. I would like to be able to select a feature individually so that I can see what terms frequently co-occur with it. Seemingly this would entail selecting the column for the feature and sorting the associated rows in descending order.

fcm_select doesn't work, because it isolates the term in both the column and the row:

>SELECT_FROM_FCM = fcm_select(
    MY_FCM,
    pattern = c("FEATURE"),
    selection = c("keep"),
    valuetype = c("glob"),
    case_insensitive = TRUE
)

>View(SELECT_FROM_FCM)

--------------------
|         | FEATURE |
 --------------------
| FEATURE | 667     |
 --------------------

dfm_subset also doesn't seem to work. Am I going about this the wrong way?

Upvotes: 0

Views: 409

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

You can form the fcm and then select it using normal matrix indexing operations. In this example, I formed a document-context feature co-occurrence matrix from the last 10 inaugural addresses, and search for the features that co-occur with the features "war" and "terror".

library("quanteda")
## Package version: 2.0.1

fcmat <- data_corpus_inaugural %>%
  tail(10) %>%
  tokens(remove_punct = TRUE) %>%
  fcm()

# select a specific feature
fcmat[, c("war", "terror")]
## Feature co-occurrence matrix of: 3,467 by 2 features.
##            features
## features    war terror
##   Senator    10      2
##   Hatfield    1      1
##   Mr         18      3
##   Chief       7      1
##   Justice     7      1
##   President  32      8
##   Vice        9      2
##   Bush        4      2
##   Mondale     1      1
##   Baker       1      1
## [ reached max_feat ... 3,457 more features ]

In the forthcoming 2.1.0 release (available on GitHub only as of 5 June 2020), you can use char_select() to get pattern matching on the features, e.g.:

# only in forthcoming 2.1.0 (currently on GitHub)
fcmat[, char_select(featnames(fcmat), "terror*")]
## Feature co-occurrence matrix of: 3,467 by 2 features.
##            features
## features    terror terrorism
##   Senator        2         2
##   Hatfield       1         1
##   Mr             3         3
##   Chief          1         2
##   Justice        1         2
##   President      8        10
##   Vice           2         2
##   Bush           2         2
##   Mondale        1         1
##   Baker          1         1
## [ reached max_feat ... 3,457 more features ]

Finally, these fcm results are easily converted into a data.frame or regular matrix for output and use in other systems, if that is what you ultimately need.

Upvotes: 0

Related Questions