Reputation: 29

Sorting and counting unknown strings in R

I have a very extensive dataset organized in rows, each carrying hundreds of characters. The original Excel file was transformed to .csv with commas used as delimiters, then imported to R. Every row is recognized as a vector, I can select specific rows, browse, etc. Here's the example of what I have:

115,"RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09";
116,"MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10";
117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38";
118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25";

As you can see, in every vector(row) there are 6 fields divided by commas. I am interested in extracting some specific unknown values from the field 5 and counting the number of occurencies of every value starting with §... sign. In fact, this data are personal records of felons found guilty by the Juvenile Court. Field 5 includes, among other unnecessary information, paragraphs from the Criminal Law. I would like to count the paragraphs and to summarize the info into the table like this:

number           Person         Paragraphs     Occurencies          
 116             MC3RIQeDk6k    §36              2
                                §250ad           1
                                §177b            1
 117             TT3Z1ralU7F    §29f             2
                                §2               3
                                §13              1
                                §671c            1

The problem is I don't know which paragraphs are going to appear in every row so I cannot search for a known criteria. I would be very thankful if someone could find any solution to this problem.

Upvotes: 1

Answers (3)

Chris Ruehlemann

Reputation: 21400

Not a trivial task but here's first go at it:

Sample data:

df <- data.frame(
  v1 = c('115,"RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09";',
         '116,"MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10";',
         '117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38";',
         '118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25";')
)

To extract the relevant values, you need to use str_extract and, if there's more than one result per string, str_extract_allas well as lookaround:

df$numb <- str_extract(df$v1, '^\\d+(?=,)')
df$pers <- str_extract(df$v1, '(?<=,")\\w+(?=")')
df$para <- lapply(lapply(str_extract_all(df$v1, '§\\w+\\b'), unique), paste0, collapse = ",")

library(dplyr)
df2 <- df %>% separate_rows(para, sep = ",")
df2$occur <- str_count(df2$v1, paste(df2$para, "\\b", sep = ""))

Result:

df2
                                                                                       v1 numb        pers   para
1             115,"RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09";  115 RD3RW2rFoVF    §12
2             115,"RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09";  115 RD3RW2rFoVF  §175a
3             115,"RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09";  115 RD3RW2rFoVF §155bc
4        116,"MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10";  116 MC3RIQeDk6k    §36
5        116,"MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10";  116 MC3RIQeDk6k §250ad
6        116,"MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10";  116 MC3RIQeDk6k  §177b
7  117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38";  117 TT3Z1ralU7F   §29f
8  117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38";  117 TT3Z1ralU7F     §2
9  117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38";  117 TT3Z1ralU7F    §13
10 117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38";  117 TT3Z1ralU7F  §671c
11    118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25";  118 QW3SlcwgU2d   §16b
12    118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25";  118 QW3SlcwgU2d    §8a
13    118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25";  118 QW3SlcwgU2d   §16a
14    118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25";  118 QW3SlcwgU2d    §22
   occur
1      1
2      1
3      1
4      2
5      1
6      1
7      2
8      3
9      1
10     1
11     2
12     1
13     1
14     1

Upvotes: 1

Martin Gal

Reputation: 16988

Assuming your data looks like Chris Ruehlemann's data. Using dplyr, tidyr and stringr

df %>%
  separate(col=v1, into=c("number", "Person","V3", "V4", "Information","V6"), "\"?,\"") %>%
  mutate(Paragraphs=str_split(Information, " |,")) %>%
  unnest(Paragraphs) %>%
  filter(str_detect(Paragraphs, "§")) %>%
  group_by(number, Person, Paragraphs) %>%
  summarise(Occurencies=n(), .groups="drop")

returns

# A tibble: 14 x 4
   number Person      Paragraphs Occurencies
   <chr>  <chr>       <chr>            <int>
 1 115    RD3RW2rFoVF §12                  1
 2 115    RD3RW2rFoVF §155bc               1
 3 115    RD3RW2rFoVF §175a                1
 4 116    MC3RIQeDk6k §177b                1
 5 116    MC3RIQeDk6k §250ad               1
 6 116    MC3RIQeDk6k §36                  2
 7 117    TT3Z1ralU7F §13                  1
 8 117    TT3Z1ralU7F §2                   3
 9 117    TT3Z1ralU7F §29f                 2
10 117    TT3Z1ralU7F §671c                1
11 118    QW3SlcwgU2d §16a                 1
12 118    QW3SlcwgU2d §16b                 2
13 118    QW3SlcwgU2d §22                  1
14 118    QW3SlcwgU2d §8a-c                1

Upvotes: 1

r.user.05apr

Reputation: 5456

If this is your data:

d <- data.frame(rbind(c("RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09"),
                      c("MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10")),
                stringsAsFactors = FALSE)
colnames(d)[4] <- "Paragraphs"
d
# X1 X2         X3                     Paragraphs       X5
# 1 RD3RW2rFoVF  0 1952869481     §12,§175a,@juvenile §155bc 14:18:09
# 2 MC3RIQeDk6k 82 7935904859 §36 §250ad §36 @criminal §177b 14:16:10

You can unnest the paragraphs like so:

library(tidyverse)
library(stringr)

d %>%
  mutate(Paragraphs = str_extract_all(Paragraphs, "§\\d+[a-zA-Z]*")) %>% # search for §-sign, at least one digit and zero or more letters
  unnest(Paragraphs) %>%
  group_by(X1, Paragraphs) %>% # check groups
  add_tally(name = "occurancies") %>% 
  ungroup()

# # A tibble: 7 x 6
# X1          X2    X3         Paragraphs X5       occurancies
# <chr>       <chr> <chr>      <chr>      <chr>          <int>
#   1 RD3RW2rFoVF 0     1952869481 §12        14:18:09           1
# 2 RD3RW2rFoVF 0     1952869481 §175a      14:18:09           1
# 3 RD3RW2rFoVF 0     1952869481 §155bc     14:18:09           1
# 4 MC3RIQeDk6k 82    7935904859 §36        14:16:10           2
# 5 MC3RIQeDk6k 82    7935904859 §250ad     14:16:10           1
# 6 MC3RIQeDk6k 82    7935904859 §36        14:16:10           2
# 7 MC3RIQeDk6k 82    7935904859 §177b      14:16:10           1

Upvotes: 0

Sorting and counting unknown strings in R

Answers (3)

Related Questions