Reputation: 542

Separating strings into one list to remove duplicates

I have a large text file (50,000 rows) I'm trying to remove duplicates/find unique words from

The rows/strings in the CSV vary such that three lines could look like the following:

I like cars
Ford
Cars go fast

I would like to first separate each row/string and then combine them so I would get the following list from above:

I
like
cars
Ford
Cars
go 
fast

Once that list is complete it should be easy to change the cases of each word and then remove duplicates leaving a unique list of all words in the document.

Some rows are paragraphs and thus Excel just can't handle the job. I'm guessing paste and paste(unique()) may be useful but I'm having trouble using read.csv to get the words from the document in the desired format.

These paragraphs may include punctuation, numbers, and random characters like @, so may need to transform the strings first.

EDIT:

3 methods that work, but different results, here is a link to the csv, any insight on why results are different would be appreciated.

https://onedrive.live.com/redir?resid=61FAC513EBF4A4B9!296&authkey=!AMsiIuW4lCD_qrs&ithint=file%2ccsv

Upvotes: 1

Answers (5)

Cédric

Reputation: 11

In order to split the string, have a look here:

"How to Split Strings in R" By Andrie de Vries and Joris Meys from R For Dummies http://www.dummies.com/how-to/content/how-to-split-strings-in-r.html

To split this text at the word boundaries (spaces), you can use strsplit() as follows:

strsplit(yourtext, " ")    # Split using spaces as boundaries

To find the unique elements of your list, you can use the unique() function:

unique(strsplit(yourtext, " "))

So there won't be duplicates any more in the result.

Upvotes: 1

akrun

Reputation: 887231

We can use scan

df1 <- data.frame(words= unique(scan(text=as.character(df$s), what="", sep=" ")))
df1
#  words
#1     I
#2  like
#3  cars
#4  Ford
#5  Cars
#6    go
#7  fast

Or a faster approach would be

library(stringi)
data.frame(words = unique(unlist(stri_extract_all(df$s, regex="\\S+"))))

Upvotes: 2

Anton

Reputation: 1538

I would put all the words into a character vector, using the stringr package for convenience, like this:

tempdata <- read.csv("temp.csv",sep=",",skip=-1,stringsAsFactors=FALSE,header=FALSE)

library(stringr)

listrows <- str_split(tempdata$V1,pattern=" ")
allwords <- unlist(listrows)

Upvotes: 1

mtoto

Reputation: 24198

Or using cSplit() from splitstackshape:

library(splitstackshape)
cSplit(df, 1, sep = " ", direction = "long")

#     V1
#1:    I
#2: like
#3: cars
#4: Ford
#5: Cars
#6:   go
#7: fast

Upvotes: 1

Gopala

Reputation: 10483

Assuming you read the original data into a data frame that looks like this:

df <- data.frame(s = c('I like cars', 'Ford', 'Cars go fast'), stringsAsFactors = FALSE)
df
             s
1  I like cars
2         Ford
3 Cars go fast

You can create your new result data frame as follows:

newdf <- data.frame(words = unlist(strsplit(df$s, ' ')))
newdf
  words
1     I
2  like
3  cars
4  Ford
5  Cars
6    go
7  fast

Upvotes: 2

Separating strings into one list to remove duplicates

Answers (5)

Related Questions