PDF Extractor SDK for Windows software developers: PDF to Text, PDF to XML, Images from PDF, Read PDF information, PDF to CSV for Excel.
Bytescout PDF Extractor SDK allows to convert PDF to text, PDF to XML, PDF to CSV, extract images from PDF, extract information about PDF files in .NET and ActiveX interfaces without any additional software required.
- converts PDF to plain text (and can follow columns if you converting a newspaper in PDF format!) - including invisible text extraction;
- converts tables in PDF to Excel (CSV) by reading cells from given rectangle;
- converts tables in PDF to XML files;
- extracts PDF file metadata (title, author, description) and get other information about the file (number of pages, encrypted or not);
- extracts embedded images from PDF document (in ASP.NET, VB.NET, C#, VB6 and VBScript);
- NEW: DocumentMerger and DocumentSplitter interfaces and classes to merge and split PDF documents;
doesn't require Adobe Reader or any other PDF reader software to be installed;
- provides .NET and ActiveX interfaces;
- made with 100% managed C# code.
Added filtering of extracted content by font name, font size and color.
Updated OCR engine to the latest version. Update language files from 'tessdata' folder.
Improved text extraction, lines grouping in tabular data, performance, XFA forms extraction, TableDetector, fixed PDF parsing issues.
added TextComparer utility class (available in .NET 4.0 assemblies only) allowing to compare text in two PDF documents and generate report;
improved support of ICC color profiles;
improved handling of embedded fonts;
fixed XMLExtractor.SaveXMLToStream() method.
PDF to XML, PDF to CSV, PDF to Text functions improved
now supports text extraction from text controls
XML extractor now adds font style, size, name, text coordinates into <text> tags
ASP.NET sample for OCR usage added
new property OCRLanguageDataFolder to specify the location of ocr data
OCR (text from pdf images) functionality: now you may extract text from embedded images and repair damaged text
issue fixed with CSV and XML extractor missing last columns with some settings
improved support for damaged PDF files
multiline search text search with word matching modes
improved stability of pdf to text
issue with the very last text line missing in some PDF files fixed
tables with empty cells are handled better now
issue fixed: incorrect extraction of overlapped text objects fixed, missing spaces between words in some files, minor issues with text search