joep1
joep1

Reputation: 323

Coding Matrix with overlap counts in R

I am proficient in Python but a complete novice in R. I can't find an answer to this question elsewhere online, and whilst it's going to be a bit lengthy, I am hoping it will be useful to other users of the R library RQDA.

Essentially, RQDA is a qualitative research tool, that is primarily used for assigning codes (themes) to text files. It's a bit like a highlighter pen that counts where it has highlighted.

If you put in a lot of files, you can code the text in different places with themes (e.g. a project about interviewing people working in cloth manufacturing might be "equipment", "sewing", "linen", "silk", "lighting", "lunch breaks", etc). This enables you to count how many times different codes were used, and in RQDA it gives a table output as follows:

rowid   cid fid codenamefilename    index1  index2  CodingLength
1   1   12  1   silk    2010-01-28  409     939     530
2   2   21  1   cotton  2010-01-28  1008    1172    164
3   3   12  1   silk    2010-01-28  1173    1924    751
4   4   39  1   sewing  2010-01-28  1008    1250    751
5   5   38  1   weaving 2010-01-28  1173    1924    751
6   6   78  1   costs   2010-01-28  727     939     212
7   7   23  1   lunch   2010-01-28  1553    1788    235
8   9   7   2   lunch   2010-01-29  1001    1230    371
9   10  4   2   weaving 2010-01-29  1547    1724    135
10  11  6   2   social  2010-01-29  1001    1290    350
11  12  7   2   silk    2010-01-29  1926    2276    350
12  14  17  2   supply  2010-01-29  1926    2276    350
13  15  78  2   costs   2010-01-29  1926    2276    350
14  17  78  2   weaving 2010-01-29  1890    2106    212

codename = code the text was given (theme)

filename = filename of text (in this case, date of diary entry)

index1 = character position in file where code starts (highlighted text)

index2 = character position in file where code ends (highlighted text)

CodingLength = overall length of coded/highlighted text

What I'd like to do is to iterate over the entire table (around 1,500 rows) with the total list of codes (codename in the table above, around 100 unique codes) in order to output a 2-way matrix of overlap between codes, for example (indicative only, with 5 codes):

    silk    cotton  sewing  weaving lunch breaks    socialising
silk    *     0      0       3       2              0
cotton  0     *      5       0       0              0
sewing  0     5      *       0       0              0
weaving 3     0      0       *       0              0
lunchs  2     0      0       0       *              5
socialg 0     0      0       0       5              *

(Code messed up a bit on this output but hopefully you get the idea)

Therefore, in R I need a bit of code that will iterate over the code list and count the number of instances where A) filename is the same and B) there is overlap in the range between index1 and index2 (CodingLength probably not important).

Apart from the following vague hunches I am lost as to exactly how to make this work:

  1. I probably need to asign the table as a variable e.g:

    coding_table <- getCodingTable()

  2. I probably need to make a list of the unique variables e.g:

    x = c("silk","cotton","weaving","sewing","lunch" ... etc. )

  3. I need a function that does the checks

  4. I need a for-loop for the rows
  5. I need a boolean test where the range and file name is checked e.g. any(409:939 %in% 727:939) && filename == filename

Based on this, can anyone see a way to produce a very short solution to this? I feel like the equivalent in python would be 10 lines maximum, but given the extra bits required in R I am completely lost as to how to do this.

Upvotes: 1

Views: 940

Answers (2)

alexis_laz
alexis_laz

Reputation: 13122

Another approach that seems valid, as I understand your description.

Find overlaps using the "IRanges" package:

fo = findOverlaps(IRanges(dat$index1, dat$index2))

Check whether the overlapped ranges belong to the same "filename":

i = dat$filename[queryHits(fo)] == dat$filename[subjectHits(fo)]

And, tabulate the "codename" for the overlapped "index1" and "index2" belonging to the same "filename":

table(dat$codename[queryHits(fo)[i]], dat$codename[subjectHits(fo)[i]])
#       
#          costs cotton lunch sewing silk social supply weaving
#  costs       2      0     0      0    2      0      1       1
#  cotton      0      1     0      1    0      0      0       0
#  lunch       0      0     2      0    1      1      0       1
#  sewing      0      1     0      1    1      0      0       1
#  silk        2      0     1      1    3      0      1       2
#  social      0      0     1      0    0      1      0       0
#  supply      1      0     0      0    1      0      1       1
#  weaving     1      0     1      1    2      0      1       3

Upvotes: 2

paqmo
paqmo

Reputation: 3729

You can use the foverlap function in the data.table package to create an edgelist and then turn this into a weighted adjacency matrix. (See here).

Using a combination of data.table, dplyr, and igraph, I think this gets you what you want (can't verify without data, though).

First, you set your data frame as a data table and set the key for index1 and index2. Then, foverlap identities entries where index1 and index2 have any overlap. After eliminating self-overlaps, replace the ids generated by foverlaps with corresponding codenames from the data set. This creates an edgelist. Pass this edgelist to igraph to create an igraph object and return it as an adjacency matrix.

require(igraph); require(data.table); require(dplyr)

el <- setkey(setDT(coding_table), filename, index1, index2) %>%
  foverlaps(., ., type="any", which=TRUE) %>%
  .[coding_table$codename[xid] != coding_table$codename[yid]] %>%
  .[, `:=`(xid = coding_table$codename[xid], yid = coding_table$codename[yid])]

m <- as.matrix(get.adjacency(graph.data.frame(el)))

Of course, dplyr is totally optional; the piping just makes it a bit neater and avoids creating more objects in the environment.

Upvotes: 2

Related Questions