Einar
Einar

Reputation: 15

Delete duplicate rows and keep the newest entry

I download a .csv file daily which I need to clean up a bit before using it in another software.

The file have data about controls conducted at restaurants, where new controls results in a new row in the file, which means each restaurant can have several entries - I only need the newest.

Here's a portion of the .csv file(only included 4 of the columns):

Headers

orgnummer;navn;dato;total_karakter

4 random rows

985129576;Økern Sushi;21092016;1
785423684;Å cafe;09072017;2
458792365;Varangerkroa;01012018;0
985129576;Økern Sushi;05052018;0

orgnummer and navn will be unique for each restaurant.

In my example the script should remove the top entry of Økern Sushi.

My script so far:

$temp = Import-Csv 'C:\Users\Downloads\tilsynPS.csv' -Delimiter ';'

#change date format to desired format

foreach($row in $temp) {
    $year = $row.dato.Substring($row.dato.Length - 4, 4)
    $month = $row.dato.Substring($row.dato.Length - 6, 2)
    $day = $row.dato.Substring(0, $row.dato.Length - 6)
    $date = New-Object System.DateTime $year,$month,$day
    $row.dato = $date
}

#Here's my attempt at sorting and deleting old records:

sort orgnummer, dato -Descending

$temp[0]

for ($i=1; $i -le $temp.length -1; $i++)  {
 if ($temp[$i]."orgnummer" -eq $temp[$i-1]."orgnummer"){
   continue
 }
 else {$temp[$i]}

}

#export to csv

$temp | Export-Csv -Encoding UTF8 -NoTypeInformation -path C:\Users\Downloads\tilsynPS_redigert.csv

The attempt at removing old records is based on this post: http://community.idera.com/powershell/ask_the_experts/f/powershell_for_windows-12/8073/csv-remove-unique-records-based-on-columns-and-last-date

have also tried with this version of the code:

sort -Property @{Expression="dato";Descending=$true},
@{Expression="navn";Descending=$false}

$temp[0]

for ($i=1; $i -le $temp.length -1; $i++)  {
 if ($temp[$i]."navn" -eq $temp[$i-1]."navn"){
   continue
 }
 else {$temp[$i]}

}

This is my first script in PowerShell, any help is much appreciated. Have noticed similar questions, but none that are in PowerShell which is what I'm working with here.

Upvotes: 1

Views: 811

Answers (1)

Theo
Theo

Reputation: 61068

I think you almost had it. This solution accumulates the results in a new array and exports that. Rows are only added if the orgnummer is different or the date they hold is newer.

Please note that for testing I had to change the path to the import and export files.

$temp = Import-Csv "$PSScriptRoot\tilsynPS.csv" -Delimiter ';'

#change date format to desired format
foreach($row in $temp) {
    $year = $row.dato.Substring($row.dato.Length - 4, 4)
    $month = $row.dato.Substring($row.dato.Length - 6, 2)
    $day = $row.dato.Substring(0, $row.dato.Length - 6)
    $row.dato = New-Object System.DateTime $year,$month,$day
}

# sort on orgnummer and date descending:
$temp = $temp | Sort-Object orgnummer, dato -Descending

# create a new array for the output and add the first entry in it already
$newData = @($temp[0])
# for all other rows in the array, check if they should be added or not
for ($i = 1; $i -le $temp.Length -1; $i++)  {
    if ($temp[$i]."orgnummer" -eq $temp[$i-1]."orgnummer" -and $temp[$i]."dato" -le $temp[$i-1]."dato") {
        continue
    }
    else { 
        $newData += $temp[$i] 
    }
}

#export to csv
$newData | Export-Csv -Encoding UTF8 -NoTypeInformation -path "$PSScriptRoot\tilsynPS_redigert.csv"

Upvotes: 1

Related Questions