CountryBoy_71
CountryBoy_71

Reputation: 13

Messy txt file data extraction - Power Query

Good day...I've been handed a very messy dataset to clean...my first thought was PQ. Below is a snippet of the 10k+ lines in the actual .txt file, it simply repeats for each user ('C:'), separated by the ----------. As a start, the list is the information that, at the moment, I need to extract. When first importing into PQ, it is one column; tab separated.

  1. C:
  2. Subscriber Name
  3. Current Charges (all fields)
  4. Other Charges and Credit
  5. Other Fees
  6. Taxes
"Company-Name." Client No:  "5780859"
" " Purchase Order No:
Invoice Date:   30-Sep-23   Unique Invoice No:  "123456789"
"12345 Main Street"
"City AB"
"T4A 1B7"

"Account Number 1234567"
"-------------------------------------------------------------------------------"
"REPORT - INDIVIDUAL DETAILS"

--------------
"C:"    "1234567890"
"Subscriber Name:"  "NAME.NAME SPARE"
"Additional line user name:"    ""
"Sublevel:" " "
"Sublevel:" ""
"Reference 1:"  ""
"Reference 2:"  ""

"Handset Transparency"
"Number/Device Information" ""
"Starting Balance"  $0.00
"Last Month's Balance"  $0.00
"Current Balance"   $0.00
"Monthly Credit"    $0.00
"Monthly Balance Adjust"    $0.00


"CURRENT CHARGES"
"Monthly Service Plan"  $40.00
"Additional Local Airtime"  $0.00
"Long Distance Charges" $22.40
"Roaming Charges"   $0.00
"Total Taxes:"  $7.49
"Total Current Charges:"    $69.89

"MONTHLY SERVICE PLAN"  01-Oct-23 to 31-Oct-23
"Service Plan Name" "Total"
"Business SharePro 5GB Q1 offer (01-Oct-23 to 31-Oct-23)"   $40.00
"Total Monthly Service Plan Charges"    $40.00

"ADDITIONAL LOCAL AIRTIME"
"Service"   "Total Airtime" "Free Airtime"  "Included Airtime"  "Chargeable Airtime"    "Total"
"Phone (minutes)"   28:00   0:00    28:00   0:00    $0.00
"Total Additional Local Airtime Charges"    $0.00

"LONG DISTANCE CHARGES"
"Service"   "Total LD Minutes"  "Free LD Minutes"   "Included LD Minutes"   "Chargeable LD Minutes" "Total"
"Domestic Phone"    28:00   0:00    0:00    28:00   $22.40
"Total Long Distance Charges"   $22.40

"ROAMING"
"Service"   "Roaming Minutes"   "Roaming Charges"   "Roaming LD Minutes"    "Roaming LD Charges"    "Roaming Surcharge" "Total"
"Total Roaming Charges" $0.00

"DO MORE DATA SERVICES"
"Service"   "Total Events"  "Event Type"    "Total"
"Total Do More Data Services Charges"   $0.00

"DO MORE VOICE SERVICES"
"Service"   "Total Events"  "Event Type"    "Total"
"Total Do More Voice Services Charges"  $0.00

"PAGER SERVICES"
"Service"   "Total Messages"    "Included Messages" "Chargeable Messages"   "Total"
"Total Pager Charges"   $0.00

"VALUE-ADDED SERVICES"  01-Oct-23 to 31-Oct-23
"Service"   "Total"
"Can - Can/US LD $0.80/min (01-Oct-23 to 31-Oct-23)"    $0.00
"Easy Roam INTL - $16/day Business (01-Oct-23 to 31-Oct-23)"    $0.00
"Easy Roam US - $14/day Business (01-Oct-23 to 31-Oct-23)"  $0.00
"UL Can - Can LD min (01-Oct-23 to 31-Oct-23)"  $0.00
"UL domestic SMS / MMS (01-Oct-23 to 31-Oct-23)"    $0.00
"Visual Voicemail (01-Oct-23 to 31-Oct-23)" $0.00
"Total Value Added Service Charges" $0.00

"OTHER CHARGES AND CREDIT"
"Charge or Credit"  "Total"
"Total Other Charges and Credits"   $0.00

"OTHER FEES"
"Service"   "Total"
"Other Fees"    $0.00

"TAXES"
""  "Total"
"GST"   $3.12
"PST - BC"  $4.37
"Total Taxes"   $7.49

After tons of playing around, I used the following steps;

  1. Filtered Rows (removed top rows)
  2. Added an index column, then a conditional column to return the value of '-----', which I then 'filled up'.
  3. Filtered again to keep only the lines actually needed
  4. Split the single column by delimiter (tab) as this is how the '.txt' file is presented
  5. Removed the (what will eventually be) headers column as it was messing with me.
  6. Grouped the rows from the conditional column and drilled down to obtain a list
  7. Final step was use a 'Table.Combine' from the list.

So, now I have the rows displaying as they should but here's the next problem. Not every user (C:) has the same number of rows (charges) so the data spills over many columns that are no longer where they should be.

Is there any way to solve for that? Is something like this more suited for Python?

Upvotes: 1

Views: 252

Answers (1)

Sam Nseir
Sam Nseir

Reputation: 12146

See if this helps move you forward...

let
    Source = Csv.Document(File.Contents("C:\5780859.txt"),[Delimiter="#(tab)", Columns=2, Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}, {"Column2", type text}}),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Customer", each if [Column1] = "C:" then [Column2] else null),
    #"Filled Down" = Table.FillDown(#"Added Custom",{"Customer"}),
    #"Filtered Rows" = Table.SelectRows(#"Filled Down", each [Customer] <> null and [Customer] <> ""),
    #"Filtered Rows1" = Table.SelectRows(#"Filtered Rows", each [Column1] <> null and [Column1] <> ""),
    #"Added Custom1" = Table.AddColumn(#"Filtered Rows1", "Section", each if [Column1] = Text.Upper([Column1]) then [Column1] else null),
    #"Filled Down1" = Table.FillDown(#"Added Custom1",{"Section"}),
    #"Filtered Rows2" = Table.SelectRows(#"Filled Down1", each ([Section] = "C:" or [Section] = "CURRENT CHARGES" or [Section] = "OTHER CHARGES AND CREDIT" or [Section] = "OTHER FEES" or [Section] = "TAXES")),
    #"Filtered Rows3" = Table.SelectRows(#"Filtered Rows2", each ([Column2] <> null and [Column2] <> "") and ([Column1] = "C:" or [Column1] = "Subscriber Name:" or [Section] <> "C:"))
in
    #"Filtered Rows3"

enter image description here

Upvotes: 3

Related Questions