Reputation: 29
I have a very extensive dataset organized in rows, each carrying hundreds of characters. The original Excel file was transformed to .csv with commas used as delimiters, then imported to R. Every row is recognized as a vector, I can select specific rows, browse, etc. Here's the example of what I have:
115,"RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09";
116,"MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10";
117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38";
118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25";
As you can see, in every vector(row) there are 6 fields divided by commas. I am interested in extracting some specific unknown values from the field 5 and counting the number of occurencies of every value starting with §... sign. In fact, this data are personal records of felons found guilty by the Juvenile Court. Field 5 includes, among other unnecessary information, paragraphs from the Criminal Law. I would like to count the paragraphs and to summarize the info into the table like this:
number Person Paragraphs Occurencies
116 MC3RIQeDk6k §36 2
§250ad 1
§177b 1
117 TT3Z1ralU7F §29f 2
§2 3
§13 1
§671c 1
The problem is I don't know which paragraphs are going to appear in every row so I cannot search for a known criteria. I would be very thankful if someone could find any solution to this problem.
Upvotes: 1
Views: 95
Reputation: 21400
Not a trivial task but here's first go at it:
Sample data:
df <- data.frame(
v1 = c('115,"RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09";',
'116,"MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10";',
'117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38";',
'118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25";')
)
To extract the relevant values, you need to use str_extract
and, if there's more than one result per string, str_extract_all
as well as lookaround:
df$numb <- str_extract(df$v1, '^\\d+(?=,)')
df$pers <- str_extract(df$v1, '(?<=,")\\w+(?=")')
df$para <- lapply(lapply(str_extract_all(df$v1, '§\\w+\\b'), unique), paste0, collapse = ",")
library(dplyr)
df2 <- df %>% separate_rows(para, sep = ",")
df2$occur <- str_count(df2$v1, paste(df2$para, "\\b", sep = ""))
Result:
df2
v1 numb pers para
1 115,"RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09"; 115 RD3RW2rFoVF §12
2 115,"RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09"; 115 RD3RW2rFoVF §175a
3 115,"RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09"; 115 RD3RW2rFoVF §155bc
4 116,"MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10"; 116 MC3RIQeDk6k §36
5 116,"MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10"; 116 MC3RIQeDk6k §250ad
6 116,"MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10"; 116 MC3RIQeDk6k §177b
7 117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38"; 117 TT3Z1ralU7F §29f
8 117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38"; 117 TT3Z1ralU7F §2
9 117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38"; 117 TT3Z1ralU7F §13
10 117,"TT3Z1ralU7F","1","9965718496","§29f §2 §13 @assault §2 §2 §671c §29f","14:13:38"; 117 TT3Z1ralU7F §671c
11 118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25"; 118 QW3SlcwgU2d §16b
12 118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25"; 118 QW3SlcwgU2d §8a
13 118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25"; 118 QW3SlcwgU2d §16a
14 118,"QW3SlcwgU2d","2","5972432856","@inc §16b §8a-c §16a §16b @road §22,"14:11:25"; 118 QW3SlcwgU2d §22
occur
1 1
2 1
3 1
4 2
5 1
6 1
7 2
8 3
9 1
10 1
11 2
12 1
13 1
14 1
Upvotes: 1
Reputation: 16988
Assuming your data looks like Chris Ruehlemann's data. Using dplyr
, tidyr
and stringr
df %>%
separate(col=v1, into=c("number", "Person","V3", "V4", "Information","V6"), "\"?,\"") %>%
mutate(Paragraphs=str_split(Information, " |,")) %>%
unnest(Paragraphs) %>%
filter(str_detect(Paragraphs, "§")) %>%
group_by(number, Person, Paragraphs) %>%
summarise(Occurencies=n(), .groups="drop")
returns
# A tibble: 14 x 4
number Person Paragraphs Occurencies
<chr> <chr> <chr> <int>
1 115 RD3RW2rFoVF §12 1
2 115 RD3RW2rFoVF §155bc 1
3 115 RD3RW2rFoVF §175a 1
4 116 MC3RIQeDk6k §177b 1
5 116 MC3RIQeDk6k §250ad 1
6 116 MC3RIQeDk6k §36 2
7 117 TT3Z1ralU7F §13 1
8 117 TT3Z1ralU7F §2 3
9 117 TT3Z1ralU7F §29f 2
10 117 TT3Z1ralU7F §671c 1
11 118 QW3SlcwgU2d §16a 1
12 118 QW3SlcwgU2d §16b 2
13 118 QW3SlcwgU2d §22 1
14 118 QW3SlcwgU2d §8a-c 1
Upvotes: 1
Reputation: 5456
If this is your data:
d <- data.frame(rbind(c("RD3RW2rFoVF","0","1952869481","§12,§175a,@juvenile §155bc","14:18:09"),
c("MC3RIQeDk6k","82","7935904859","§36 §250ad §36 @criminal §177b","14:16:10")),
stringsAsFactors = FALSE)
colnames(d)[4] <- "Paragraphs"
d
# X1 X2 X3 Paragraphs X5
# 1 RD3RW2rFoVF 0 1952869481 §12,§175a,@juvenile §155bc 14:18:09
# 2 MC3RIQeDk6k 82 7935904859 §36 §250ad §36 @criminal §177b 14:16:10
You can unnest the paragraphs like so:
library(tidyverse)
library(stringr)
d %>%
mutate(Paragraphs = str_extract_all(Paragraphs, "§\\d+[a-zA-Z]*")) %>% # search for §-sign, at least one digit and zero or more letters
unnest(Paragraphs) %>%
group_by(X1, Paragraphs) %>% # check groups
add_tally(name = "occurancies") %>%
ungroup()
# # A tibble: 7 x 6
# X1 X2 X3 Paragraphs X5 occurancies
# <chr> <chr> <chr> <chr> <chr> <int>
# 1 RD3RW2rFoVF 0 1952869481 §12 14:18:09 1
# 2 RD3RW2rFoVF 0 1952869481 §175a 14:18:09 1
# 3 RD3RW2rFoVF 0 1952869481 §155bc 14:18:09 1
# 4 MC3RIQeDk6k 82 7935904859 §36 14:16:10 2
# 5 MC3RIQeDk6k 82 7935904859 §250ad 14:16:10 1
# 6 MC3RIQeDk6k 82 7935904859 §36 14:16:10 2
# 7 MC3RIQeDk6k 82 7935904859 §177b 14:16:10 1
Upvotes: 0