zeplike
zeplike

Reputation: 21

Data manipulation\deduplication in powershell

Hey I am looking to deduplicate some data and combine columns from a CSV. Can't get my head around how to do it. Here is a sample of the data I am working with:

cmmc,stig,descr
AC.1.001,SV-205663r569188_rule,The ability to set access permissions and auditing is critical to maintaining the security and proper access controls of a system. To support this volumes must be formatted using a file system that supports NTFS attributes.
AC.1.001,SV-205667r569188_rule,Inappropriate granting of user rights can provide system administrative and other high-level capabilities.
AC.1.002,SV-205663r569188_rule,The ability to set access permissions and auditing is critical to maintaining the security and proper access controls of a system. To support this volumes must be formatted using a file system that supports NTFS attributes.
AC.1.002,SV-205665r569188_rule,Enterprise Domain Controllers groups on domain controllers.

I am pretty close to the data I am looking for but struggling to add the |<value of 'descr'> after the item in the second column:

Here's my script:

Import-CSV '.\input.csv' | Group-Object 'cmmc' |
    ForEach-Object {
        [PsCustomObject]@{
            cmmc = $_.name
            stig = $_.group.stig -Join '
'
                    }
    } | Export-Csv '.\output.csv' -NoTypeInformation

The output looks like this (formatted for readability, column names omitted):

AC1.001    SV-205663r569188_rule
           SV-205665r569188_rule
AC1.002    SV-205663r569188_rule
           SV-205665r569188_rule

But I am looking for this:

AC.1.001 SV-205663r569188_rule|The ability to set access permissions and auditing is critical to maintaining the security and proper access controls of a system. To support this volumes must be formatted using a file system that supports NTFS attributes.
         SV-205667r569188_rule|Inappropriate granting of user rights can provide system administrative and other high-level capabilities.
AC.1.002 SV-205663r569188_rule|The ability to set access permissions and auditing is critical to maintaining the security and proper access controls of a system. To support this volumes must be formatted using a file system that supports NTFS attributes.
         SV-205665r569188_rule|Enterprise Domain Controllers groups on domain controllers.

Upvotes: 1

Views: 105

Answers (1)

mklement0
mklement0

Reputation: 438323

Use the following, which makes use of calculated properties in combination with the Select-Object cmdlet applied to the results from your Group-Object call:

Import-Csv .\input.csv | 
  Group-Object cmmc |
    Select-Object @{ Name = 'cmmc'; e = 'Name' },
      @{ Name = 'stig_descr'; e = { 
          [array] $stigs, [array] $descrs, $i = $_.Group.stig, $_.Group.descr, 0
          $sigs.ForEach( { $stigs[$i], $descrs[$i++] -join '|' }) -join "`n" 
        } 
      } | Export-Csv -NoTypeInformation -Encoding utf8 .\output.csv

Note:
• The [array] type constraints for $stigs and $descrs are needed to handle the case where a group comprises only one record, in which case $_.Group.sig and $_.Group.descr, due to the behavior of member-access enumeration, return only a single string rater than a single-element array; without the [array] cast, indexing (e.g. [$i]) would then be performed on [string] instances, which would return the a single character at that position from the string.
• In the Export-Csv call, adjust -Encoding as needed. BOM-less UTF-8 is now the default in PowerShell (Core) 7+, and -NoTypeInformation is no longer required there.

The resulting file has the following content, showing the use of column-internal newlines (which are protected by the value as a whole being enclosed in "..."):

"cmmc","stig_descr"
"AC.1.001","SV-205663r569188_rule|The ability to set access permissions and auditing is critical to maintaining the security and proper access controls of a system. To support this volumes must be formatted using a file system that supports NTFS attributes.
SV-205667r569188_rule|Inappropriate granting of user rights can provide system administrative and other high-level capabilities."
"AC.1.002","SV-205663r569188_rule|The ability to set access permissions and auditing is critical to maintaining the security and proper access controls of a system. To support this volumes must be formatted using a file system that supports NTFS attributes.
SV-205665r569188_rule|Enterprise Domain Controllers groups on domain controllers."

To visualize that this yields the desired data, you can re-import the resulting file and pipe it to Format-Table with the -Wrap switch:

PS> Import-Csv .\output.csv | Format-Table -Wrap

cmmc     stig_descr
----     ---------
AC.1.001 SV-205663r569188_rule|The ability to set access permissions and auditing is critical to maintaining the security and proper access controls of a system. To support this volumes must be formatted using a file system that supports NTFS attributes.
         SV-205667r569188_rule|Inappropriate granting of user rights can provide system administrative and other high-level capabilities.
AC.1.002 SV-205663r569188_rule|The ability to set access permissions and auditing is critical to maintaining the security and proper access controls of a system. To support this volumes must be formatted using a file system that supports NTFS attributes.
         SV-205665r569188_rule|Enterprise Domain Controllers groups on domain controllers.

Note that -Wrap respects the property-internal newlines, but additionally breaks individual lines into multiple ones if they're too wide for the console window.

Upvotes: 1

Related Questions