LDG
LDG

Reputation: 41

UTF-8 encoding problem when merging txt files in Powershell

I need to merge all txt-files in a certain folder on my computer. There's hundreds of them and they all have a different name, so any code where you had to manually type the name of the files in order to merge them was not working for me. The files are in "UTF-8"-encoding and contain emojis and characters from different languages (such as Cyrillic script) as well as characters with accents and so on (e.g. é, ü, à...). A fellow stackoverflow-user was so kind as to give me the following code to run in Powershell:

(gc *.txt) | out-file newfile.txt -encoding utf8

It works wonderfully for merging the files. However, it actually gives me a txt-file with "UTF-8 with BOM"-encoding, instead of with "UTF-8"-encoding. Furthermore, all emojis and special characters have been removed and exchanged for others, such as "ü" instead of "ü". It is very importatnt for what I am doing that these emojis and special characters remain.

Could someone help me with tweaking this code (or suggesting a different one) so it gives me a merged txt-file with "UTF-8"-encoding that still contains all of the special characters? Please keep in mind that I am a layperson.

Thank you so much in advance for your help and kind regards!

Upvotes: 2

Views: 4351

Answers (2)

js2010
js2010

Reputation: 27566

PS 5 (gc) can't handle utf8 no bom input files without the -encoding parameter:

(gc -Encoding Utf8 *.txt) | out-file newfile.txt -encoding utf8

Upvotes: 1

Theo
Theo

Reputation: 61208

In PowerShell < 6.0, the Out-File cmdlet does not have a Utf8NoBOM encoding.
You can however write Utf8 text files without BOM using .NET:

Common for all methods below

$rootFolder = 'D:\test'  # the path where the textfiles to merge can be found
$outFile    = Join-Path -Path $rootFolder -ChildPath 'newfile.txt'

Method 1

# create a Utf8NoBOM encoding object
$utf8NoBom = New-Object System.Text.UTF8Encoding $false  # $false means NoBOM
Get-Content -Path "$rootFolder\*.txt" -Encoding UTF8 -Raw | ForEach-Object {
    [System.IO.File]::AppendAllText($outFile, $_, $utf8NoBom)
}

Method 2

# create a Utf8NoBOM encoding object
$utf8NoBom = New-Object System.Text.UTF8Encoding $false  # $false means NoBOM
Get-ChildItem -Path $rootFolder -Filter '*.txt' -File | ForEach-Object {
    [System.IO.File]::AppendAllLines($outFile, [string[]]($_ | Get-Content -Encoding UTF8), $utf8NoBom)
}

Method 3

# Create a StreamWriter object which by default writes Utf8 without a BOM.
$sw = New-Object System.IO.StreamWriter $outFile, $true  # $true is for Append
Get-ChildItem -Path $rootFolder -Filter '*.txt' -File | ForEach-Object {
    Get-Content -Path $_.FullName -Encoding UTF8 | ForEach-Object {
        $sw.WriteLine($_)
    }
}
$sw.Dispose()

Upvotes: 4

Related Questions