Есть ли какое-либо решение для извлечения таблицы без полей из PDF в CSV?

Question

Есть ли какое-либо решение для извлечения таблицы без полей из PDF в CSV?

введите описание изображения здесь

Это мой пример изображения из файла pdf с 75 страницами.

1

python tabula pymupdf

Источник

user13695629 08 июн '20 в 10:49

2 ответа

Другие вопросы по тегам python tabula pymupdf

user7445528 08 июн '20 в 11:02 2020-06-08 11:02 · Answer 1 · 2020-06-08 11:02

Вы можете сделать это с помощью Python и модуля tabula. Поскольку у него нет полей, вы можете сначала найти эту область динамически с помощью моей функции get_area (изменить количество страниц и т. Д.):

from tabula import convert_into, convert_into_by_batch, read_pdf
from tabulate import tabulate

def get_area(file):
    """Set and return the area from which to extract data from within a PDF page
    by reading the file as JSON, extracting the locations
    and expanding these.
    """
    tables = read_pdf(file, output_format="json", pages=2, silent=True)
    top = tables[0]["top"]
    left = tables[0]["left"]
    bottom = tables[0]["height"] + top
    right = tables[0]["width"] + left
    # print(f"{top=}\n{left=}\n{bottom=}\n{right=}")
    return [top - 20, left - 20, bottom + 10, right + 10]

Перед преобразованием убедитесь, что формат вашей первой таблицы выглядит правильно:

def inspect_1st_table(file: str):

    df = read_pdf(
        file,
        # output_format="dataframe",
        multiple_tables=True,
        pages="all",
        area=get_area(file),
        silent=True,  # Suppress all stderr output
    )[0]
    print(tabulate(df.head()))

Затем используйте эту область для извлечения таблицы из pdf в csv:

def convert_pdf_to_csv(file: str):
    """Output all the tables in the PDF to a CSV"""
    convert_into(
        file,
        file[:-3] + "csv",
        output_format="csv",
        pages="all",
        area=get_area(file),
        silent=True,
    )

Если вам нужно извлечь более 1 таблицы, снова начните с их проверки:

def show_tables(file: str):
    """Read pdf into list of DataFrames"""
    tables = read_pdf(
        file, pages="all", multiple_tables=True, area=get_area(file), silent=True
    )
    for df in tables:
        print(tabulate(df))

И для пакетного преобразования всех таблиц pdf в формат csv:

def convert_batch(directory: str):
    """convert all PDFs in a directory"""
    convert_into_by_batch(directory, output_format="csv", pages="all", silent=True)

user13632128 08 июн '20 в 11:20 2020-06-08 11:20 · Answer 2 · 2020-06-08 11:20

Камелот - отличный вариант для извлечения таблиц без полей. Вы можете использовать опцию flavour = stream для извлечения.

tables = camelot.read_pdf('sample.pdf', flavor='stream', edge_tol=500, pages='1-end')

#tables from all your pages will be stored in the tables object
tables[0].df

df.to_csv()

0

Источник

user13632128 08 июн '20 в 11:20