academic.user
academic.user

Reputation: 679

read csv +unicode in R

I have the same problem as explain in here ,the only difference is that the CSV file contain non_english string and I couldn't find any solution for it : when I read the csv file with out encoding it gives me no error but the data changed to :

network=read.csv("graph1.csv",header=TRUE)

  اشپیل(60*4)

and if I run the read.csv with fileEncoding it gives me this error:

 network=read.csv("graph1.csv",fileEncoding="UTF-8",header=TRUE)
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection 'graph1.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'graph1.csv'

 network[1]
[1] X.
<0 rows> (or 0-length row.names)

system info :

windows server 2008
R:R3.1.2

sample file :

node1,node2,weight
ورق800*750*6,ورق 1350*1230*6mm,0.600000024
ورق900*1200*6,ورق 1350*1230*6mm,0.600000024
ورق76*173,ورق 1350*1230*6mm,0.600000024
ورق76*345,ورق 1350*1230*6mm,0.600000024
ورق800*200*4,ورق 1350*1230*6mm,0.600000024

Upvotes: 3

Views: 7480

Answers (2)

Konrad Rudolph
Konrad Rudolph

Reputation: 546093

The following should work – mind you, I can’t test it since I don’t have Windows (and Windows, Unicode and R simply do not mix):

x = read.csv('graph1.csv', fileEncoding = '', stringsAsFactors = TRUE)

At this point, x is gibberish, since it was read as-is, without parsing the byte data into an encoding. We should be able to verify this:

Encoding(x[1, 1])
# [1] "unknown"

Now we tell R to treat it as UTF-8:

x = as.data.frame(lapply(x, iconv, from = 'UTF-8', to = 'UTF-8),
                  stringsAsFactors = FALSE)

These two steps can be compressed into one by using encoding instead of fileEncoding as the argument to read.csv:

x = read.csv('graph1.csv', encoding = 'UTF-8', stringsAsFactors = TRUE)

In either case, roughly the same process takes place.

At this point, x still appears as gibberish, since your terminal on Windows presumably does not support a Unicode code page which R understands. In fact, when running the code with a non-UTF-8 code page on Mac, I get the following output now:

x[1, 1]
# [1] "<U+0648><U+0631><U+0642>800*750*6"

However, at least the encoding is now correctly set, and the bytes are parsed:

Encoding(x[1, 1])
# [1] "UTF-8"

And if you pass the data to a device or program which speaks UTF-8, it should appear correctly. For instance, using the data as labels in a plot command should work.

plot.new()
text(0.5, seq(0, 1, along.with = x[, 1]), x[, 1])

plot output

Upvotes: 2

Colonel Beauvel
Colonel Beauvel

Reputation: 31181

I tried with your input this:

> read.csv("graph1.csv", encoding="UTF-8")
                      X.U.FEFF.node1                                  node2 weight
1  <U+0648><U+0631><U+0642>800*750*6 <U+0648><U+0631><U+0642> 1350*1230*6mm    0.6
2 <U+0648><U+0631><U+0642>900*1200*6 <U+0648><U+0631><U+0642> 1350*1230*6mm    0.6
3     <U+0648><U+0631><U+0642>76*173 <U+0648><U+0631><U+0642> 1350*1230*6mm    0.6
4     <U+0648><U+0631><U+0642>76*345 <U+0648><U+0631><U+0642> 1350*1230*6mm    0.6
5  <U+0648><U+0631><U+0642>800*200*4 <U+0648><U+0631><U+0642> 1350*1230*6mm    0.6

Upvotes: 2

Related Questions