pyPdf игнорирует переводы строк в файле PDF

Question

pyPdf игнорирует переводы строк в файле PDF

Я пытаюсь извлечь каждую страницу PDF в виде строки:

import pyPdf

pages = []
pdf = pyPdf.PdfFileReader(file('g-reg-101.pdf', 'rb'))
for i in range(0, pdf.getNumPages()):
    this_page = pdf.getPage(i).extractText() + "\n"
    this_page = " ".join(this_page.replace(u"\xa0", " ").strip().split())
    pages.append(this_page.encode("ascii", "xmlcharrefreplace"))
for page in pages:
    print '*' * 80
    print page

Но этот скрипт игнорирует символы новой строки, оставляя меня с такими беспорядочными строками, как information concerning an individual which, because of name, identifyingnumber, mark or description (то есть это следует читать identifying numberне identifyingumber).

Вот пример типа PDF, который я пытаюсь разобрать.

10

python string pdf unicode pypdf

Источник

user593956 13 июн '12 в 14:43

3 ответа

Решение

pyPdf на самом деле не предназначен для такого типа извлечения текста, попробуйте pdfminer (или используйте pdftotext или что-то в этом роде, если вы не против создания другого процесса)

0

Источник

user601581 26 июн '12 в 14:27

Расширение ответа DSM. Ниже показано, как вы могли бы реализовать это, расширив несколько классов.

      import PyPDF2
import pandas as pd
from PyPDF2.generic import TextStringObject
from PyPDF2.pdf import ContentStream, IndirectObject, NameObject
from PyPDF2.utils import b_, u_

class PageObject2(PyPDF2.pdf.PageObject):
    def extractText(self, Tj_sep="", TJ_sep=""):
        """
        Locate all text drawing commands, in the order they are provided in the
        content stream, and extract the text.  This works well for some PDF
        files, but poorly for others, depending on the generator used.  This will
        be refined in the future.  Do not rely on the order of text coming out of
        this function, as it will change if this function is made more
        sophisticated.

        :return: a unicode string object.
        """
        text = u_("")
        content = self["/Contents"].getObject()
        if not isinstance(content, ContentStream):
            content = ContentStream(content, self.pdf)
        # Note: we check all strings are TextStringObjects.  ByteStringObjects
        # are strings where the byte->string encoding was unknown, so adding
        # them to the text here would be gibberish.
        for operands, operator in content.operations:
            if operator == b_("Tj"):
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += Tj_sep
                    text += _text
            elif operator == b_("T*"):
                text += "\n"
            elif operator == b_("'"):
                text += "\n"
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += operands[0]
            elif operator == b_('"'):
                _text = operands[2]
                if isinstance(_text, TextStringObject):
                    text += "\n"
                    text += _text
            elif operator == b_("TJ"):
                for i in operands[0]:
                    if isinstance(i, TextStringObject):
                        text += TJ_sep
                        text += i
                text += "\n"
        return text


class PdfFileReader2(PyPDF2.PdfFileReader):
    def _flatten(self, pages=None, inherit=None, indirectRef=None):
        inheritablePageAttributes = (
            NameObject("/Resources"), NameObject(
                "/MediaBox"),
            NameObject("/CropBox"), NameObject("/Rotate")
        )
        if inherit == None:
            inherit = dict()
        if pages == None:
            self.flattenedPages = []
            catalog = self.trailer["/Root"].getObject()
            pages = catalog["/Pages"].getObject()

        t = "/Pages"
        if "/Type" in pages:
            t = pages["/Type"]

        if t == "/Pages":
            for attr in inheritablePageAttributes:
                if attr in pages:
                    inherit[attr] = pages[attr]
            for page in pages["/Kids"]:
                addt = {}
                if isinstance(page, IndirectObject):
                    addt["indirectRef"] = page
                self._flatten(page.getObject(), inherit, **addt)
        elif t == "/Page":
            for attr, value in list(inherit.items()):
                # if the page has it's own value, it does not inherit the
                # parent's value:
                if attr not in pages:
                    pages[attr] = value
            pageObj = PageObject2(self, indirectRef)
            pageObj.update(pages)
            self.flattenedPages.append(pageObj)


# creating an object
file = open('travelers.pdf', 'rb')

# creating a pdf reader object
fileReader = PdfFileReader2(file)

# print the number of pages in pdf file
no_of_pages = fileReader.numPages

pageObj = fileReader.getPage(page_no)
page = pageObj.extractText(Tj_sep='\n')

0

Источник

user3842788 21 июл '21 в 10:46

Другие вопросы по тегам python string pdf unicode pypdf

user487339 19 июн '12 в 18:55 2012-06-19 18:55 · Accepted Answer · 2012-06-19 18:55

Я не знаю много о кодировании PDF, но я думаю, что вы можете решить вашу конкретную проблему, изменив pdf.py, в PageObject.extractText метод, вы видите, что происходит:

def extractText(self):
    [...]
    for operands,operator in content.operations:
        if operator == "Tj":
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == "T*":
            text += "\n"
        elif operator == "'":
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == '"':
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == "TJ":
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i

Если оператор Tj или же TJ (это Tj в вашем примере PDF), тогда текст просто добавляется, и новая строка не добавляется. Теперь вам не обязательно добавлять новую строку, по крайней мере, если я правильно читаю ссылку на PDF: Tj/TJ это просто один и несколько операторов show-string, и наличие какого-либо разделителя не обязательно.

Во всяком случае, если вы измените этот код, чтобы быть что-то вроде

def extractText(self, Tj_sep="", TJ_sep=""):

[...]

        if operator == "Tj":
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += Tj_sep
                text += _text

[...]

        elif operator == "TJ":
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += TJ_sep
                    text += i

тогда поведение по умолчанию должно быть таким же:

In [1]: pdf.getPage(1).extractText()[1120:1250]
Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv'

но вы можете изменить его, если хотите:

In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250]
Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily '

или же

In [3]: pdf.getPage(1).extractText(Tj_sep="\n")[1120:1250]
Out[3]: u'ta" means any information concerning an individual which, because of name, identifying\nnumber, mark or description can be readily '

В качестве альтернативы, вы можете просто добавить разделители самостоятельно, изменив сами операнды на месте, но это может нарушить что-то другое (такие методы, как get_original_bytes заставь меня нервничать).

Наконец, вам не нужно редактировать pdf.py сам, если вы не хотите: вы можете просто вытащить этот метод в функцию.