Reputation: 6605
How to Extract text with format from PDF or XPS using C#?
I have some PDF/XPS files which are generated by another reporting software. The file mainly includes tables which list some data.
iText can extract the text from the pdf files, but then lost some format, for example, for below table, the extracted text is:
Faults
Count FMI Lookup Code Description Component Status
Active Body Controller Heating Ventilation/Air Conditioning (HVAC) Control
Head Air Inlet DM1. HVAC motor in wrong position or
jammed
SPN 3984 2 126
Active Engine SAE - Catalyst 1 System Monitor - Root cause not known SID 380 11 N/A
Inactive Engine SAE - Crankcase Pressure - Data valid but above normal
operational range - Most severe level
PID 101 0 N/A
Inactive Engine SAE - Crankcase Pressure - Data erratic, intermittent or
incorrect
PID 101 2 N/A
The problem is the text in different columns are on the same line, which makes it almost impossible to know exactly which text is for which column. And unfortunately, I need to save data in different columns to different field in database.
I also tried to convert the PDF to html, but then found that html does not include the actual text, which uses SVG in the html. So I could not get actual text.
Is there a way to do this using C#? Any suggestions? Any library, better free ones?
Thanks
Upvotes: 5
Views: 1831
Reputation: 1450
You can extract formatted text using Docotic.Pdf (disclaimer: I am the co-author). Here is the basic sample code:
using (var pdf = new PdfDocument("your_document.pdf"))
{
string formattedText = pdf.GetTextWithFormatting();
using (var writer = new StreamWriter("formatted.txt"))
writer.Write(formattedText);
}
After that you can detect columns by whitespaces. For example, treat a sequence of 3+ whitespaces as a column separator.
You can find other text extraction techniques in this article. For example, these methods might be useful too:
Upvotes: 3
Reputation: 3235
If you know all possible values for 'Status', 'Component' and 'Lookup Code' prefix you can use such an approach: you can see that every entry is strucured as 'Status-Component-Description-LookupCode- FMI-Count'. Add an entity:
class Fault
{
public string Count { get; set; }
public string FMI { get; set; }
public string LookupCode { get; set; }
public string Description { get; set; }
public string Component { get; set; }
public string Status { get; set; }
public override string ToString() =>
$"Status: {Status}; Component: {Component}; Description: {Description}; LookupCode: {LookupCode}; FMI: {FMI}; Count: {Count}";
}
And map your text input this way:
class Parser
{
private static readonly IReadOnlyList<string> statuses = new[]
{
"Active",
"Inactive"
// etc
};
private static readonly IReadOnlyList<string> components = new[]
{
"Body Controller",
"Engine"
// etc
};
private static readonly IReadOnlyList<string> lookupPrefixes = new[]
{
"SPN",
"SID",
"PID"
// etc
};
public static IEnumerable<Fault> Parse(string str)
{
var lines = str.Split(Environment.NewLine).Skip(2);
foreach(var group in GetGroups(lines))
{
var words = group.SelectMany(line => line.Split()).ToList();
var i = 1;
string status = default;
while (!statuses.Contains(status = string.Join(' ', words.Take(i)))) i++;
words = words.Skip(i).ToList();
i = 1;
string component = default;
while (!components.Contains(component = string.Join(' ', words.Take(i)))) i++;
words = words.Skip(1).Reverse().ToList();
string count = words[0];
string fmi = words[1];
words = words.Skip(2).ToList();
i = words.FindIndex(word => lookupPrefixes.Contains(word)) + 1;
string code = string.Join(' ', words.Take(i).Reverse());
string description = string.Join(' ', words.Skip(i).Reverse());
yield return new Fault
{
Status = status,
Component = component,
Description = description,
LookupCode = code,
FMI = fmi,
Count = count
};
}
}
private static IEnumerable<IEnumerable<string>> GetGroups(IEnumerable<string> lines)
{
var list = new List<string> { lines.First() };
foreach (var line in lines.Skip(1))
{
if(statuses.Any(status => line.StartsWith(status)))
{
yield return list;
list = new List<string>();
}
list.Add(line);
}
yield return list;
}
}
Then you can use it:
class Program
{
private static readonly string input =
@"Faults
Count FMI Lookup Code Description Component Status
Active Body Controller Heating Ventilation/Air Conditioning(HVAC) Control
Head Air Inlet DM1.HVAC motor in wrong position or
jammed
SPN 3984 2 126
Active Engine SAE - Catalyst 1 System Monitor - Root cause not known SID 380 11 N/A
Inactive Engine SAE - Crankcase Pressure - Data valid but above normal
operational range - Most severe level
PID 101 0 N/A
Inactive Engine SAE - Crankcase Pressure - Data erratic, intermittent or
incorrect
PID 101 2 N/A";
static void Main()
{
new Program().Run();
}
private void Run()
{
foreach (var result in Parser.Parse(input))
Console.WriteLine(result);
}
}
and get:
Status: Active; Component: Body Controller; Description: Controller Heating Ventilation/Air Conditioning(HVAC) Control Head Air Inlet DM1.HVAC motor in wrong position or jammed; LookupCode: SPN 3984; FMI: 2; Count: 126
Status: Active; Component: Engine; Description: SAE - Catalyst 1 System Monitor - Root cause not known; LookupCode: SID 380; FMI: 11; Count: N/A
Status: Inactive; Component: Engine; Description: SAE - Crankcase Pressure - Data valid but above normal operational range - Most severe level; LookupCode: PID 101; FMI: 0; Count: N/A
Status: Inactive; Component: Engine; Description: SAE - Crankcase Pressure - Data erratic, intermittent or incorrect; LookupCode: PID 101; FMI: 2; Count: N/A
The solution is subject to optimizations.
Upvotes: 1