Reputation: 542
I have a large text file (50,000 rows) I'm trying to remove duplicates/find unique words from
The rows/strings in the CSV vary such that three lines could look like the following:
I like cars
Ford
Cars go fast
I would like to first separate each row/string and then combine them so I would get the following list from above:
I
like
cars
Ford
Cars
go
fast
Once that list is complete it should be easy to change the cases of each word and then remove duplicates leaving a unique list of all words in the document.
Some rows are paragraphs and thus Excel just can't handle the job. I'm guessing paste
and paste(unique())
may be useful but I'm having trouble using read.csv
to get the words from the document in the desired format.
These paragraphs may include punctuation, numbers, and random characters like @, so may need to transform the strings first.
EDIT:
3 methods that work, but different results, here is a link to the csv, any insight on why results are different would be appreciated.
Upvotes: 1
Views: 580
Reputation: 11
In order to split the string, have a look here:
"How to Split Strings in R" By Andrie de Vries and Joris Meys from R For Dummies http://www.dummies.com/how-to/content/how-to-split-strings-in-r.html
To split this text at the word boundaries (spaces), you can use strsplit() as follows:
strsplit(yourtext, " ") # Split using spaces as boundaries
To find the unique elements of your list, you can use the unique() function:
unique(strsplit(yourtext, " "))
So there won't be duplicates any more in the result.
Upvotes: 1
Reputation: 887231
We can use scan
df1 <- data.frame(words= unique(scan(text=as.character(df$s), what="", sep=" ")))
df1
# words
#1 I
#2 like
#3 cars
#4 Ford
#5 Cars
#6 go
#7 fast
Or a faster approach would be
library(stringi)
data.frame(words = unique(unlist(stri_extract_all(df$s, regex="\\S+"))))
Upvotes: 2
Reputation: 1538
I would put all the words into a character vector, using the stringr package for convenience, like this:
tempdata <- read.csv("temp.csv",sep=",",skip=-1,stringsAsFactors=FALSE,header=FALSE)
library(stringr)
listrows <- str_split(tempdata$V1,pattern=" ")
allwords <- unlist(listrows)
Upvotes: 1
Reputation: 24198
Or using cSplit()
from splitstackshape
:
library(splitstackshape)
cSplit(df, 1, sep = " ", direction = "long")
# V1
#1: I
#2: like
#3: cars
#4: Ford
#5: Cars
#6: go
#7: fast
Upvotes: 1
Reputation: 10483
Assuming you read the original data into a data frame that looks like this:
df <- data.frame(s = c('I like cars', 'Ford', 'Cars go fast'), stringsAsFactors = FALSE)
df
s
1 I like cars
2 Ford
3 Cars go fast
You can create your new result data frame as follows:
newdf <- data.frame(words = unlist(strsplit(df$s, ' ')))
newdf
words
1 I
2 like
3 cars
4 Ford
5 Cars
6 go
7 fast
Upvotes: 2