Reputation: 783
I have a JSON file with different names of countries and languages etc. I want to strip it down to just the information I need/want for what I am doing. For example I would like to turn
[{
"name": {
"common": "Afghanistan",
"official": "Islamic Republic of Afghanistan",
"native": {
"common": "\u0627\u0641\u063a\u0627\u0646\u0633\u062a\u0627\u0646",
"official": "\u062f \u0627\u0641\u063a\u0627\u0646\u0633\u062a\u0627\u0646 \u0627\u0633\u0644\u0627\u0645\u064a \u062c\u0645\u0647\u0648\u0631\u06cc\u062a"
}
},
"tld": [".af"],
"cca2": "AF",
"ccn3": "004",
"cca3": "AFG",
"currency": ["AFN"],
"callingCode": ["93"],
"capital": "Kabul",
"altSpellings": ["AF", "Af\u0121\u0101nist\u0101n"],
"relevance": "0",
"region": "Asia",
"subregion": "Southern Asia",
"nativeLanguage": "pus",
"languages": {
"prs": "Dari",
"pus": "Pashto",
"tuk": "Turkmen"
},
"translations": {
"cym": "Affganistan",
"deu": "Afghanistan",
"fra": "Afghanistan",
"hrv": "Afganistan",
"ita": "Afghanistan",
"jpn": "\u30a2\u30d5\u30ac\u30cb\u30b9\u30bf\u30f3",
"nld": "Afghanistan",
"rus": "\u0410\u0444\u0433\u0430\u043d\u0438\u0441\u0442\u0430\u043d",
"spa": "Afganist\u00e1n"
},
"latlng": [33, 65],
"demonym": "Afghan",
"borders": ["IRN", "PAK", "TKM", "UZB", "TJK", "CHN"],
"area": 652230
}, ...
Into
[{
"name": {
"common": "Afghanistan",
"native": {
"common": "\u0627\u0641\u063a\u0627\u0646\u0633\u062a\u0627\u0646"
}
},
"cca2": "AF"
}, ...
But when I try I get
[{
"name": {
"common": "Afghanistan",
"native": {
"common": "?????????" <-- NOT WHAT I WANT
}
},
"cca2": "AF"
},
Here is the important code I used to strip out what I don't want.
byte[] encoded = Files.readAllBytes(Paths.get("countries.json"));
String JSONString = new String(encoded, Charset.forName("US-ASCII"));
...
Writer writer = new OutputStreamWriter(new FileOutputStream("countriesBetter.json"), "US-ASCII");
writer.write(javaObject.toString());
writer.close();
I cannot figure out why it turns the text into question marks. I have tried several character sets to no avail. When I use UTF-8 i get ا�غانستان
Please help me. Thank you.
Upvotes: 0
Views: 146
Reputation: 66
\u0627 is unicode not ascii and you cannot represent the arabic characters in ascii - hence the ?. For differences between utf formats see Difference between UTF-8 and UTF-16?
when you write it UTF-8 you need to read in the same encoding so the "notepad" knows how to display the bytes it has. If you read it back into java using that encoding it will be unaltered.
Upvotes: 1
Reputation: 394
You will need to change the console encoding to see this.
Go to Run>Run configurations
A pop up will open. Select common tab. In the Encoding section, select other and in dropdown select UTF-8.
Now run the program. I got the below result:
[ {
"name" : {
"common" : "Afghanistan",
"official" : "Islamic Republic of Afghanistan",
"natives" : {
"common" : "افغانستان",
"official" : "د افغانستان اسلامي جمهوریت"
}
},
"tld" : [ ".af" ],
"cca2" : "AF",
"ccn3" : "004",
"cca3" : "AFG",
"currency" : [ "AFN" ],
"callingCode" : [ "93" ],
"capital" : "Kabul",
"altSpellings" : [ "AF", "Afġānistān" ],
"relevance" : "0",
"region" : "Asia",
"subregion" : "Southern Asia",
"nativeLanguage" : "pus",
"languages" : {
"prs" : "Dari",
"pus" : "Pashto",
"tuk" : "Turkmen"
},
"translations" : {
"cym" : "Affganistan",
"deu" : "Afghanistan",
"fra" : "Afghanistan",
"hrv" : "Afganistan",
"ita" : "Afghanistan",
"jpn" : "アフガニスタン",
"nld" : "Afghanistan",
"rus" : "Афганистан",
"spa" : "Afganistán"
},
"latlng" : [ 33, 65 ],
"demonym" : "Afghan",
"borders" : [ "IRN", "PAK", "TKM", "UZB", "TJK", "CHN" ],
"area" : 652230
} ]
Upvotes: 0