Невозможно прочитать PDF с помощью Camelot

Я использовал camelot читать PDF-файл, но я могу получить только часть его.

Как прочитать всю страницу?

import camelot
import pandas as pd
tables = camelot.read_pdf('data.pdf', pages='all', flavor = 'stream')
df = tables[0].df

Результат df является

                                              0            1  \
0                                                               
1   Land Parcel                                   City          
2                                                               
3                                                               
4   Land Parcel No. CTP-1813                      Cangzhou 滄州   
5   .\n.\n.\n.\n.\n.\n.\n.\n.\n.\nCTP-1813 號地塊 .                
6   Land Parcel No. 2018GC22026                   Beihai 北海     
7   .\n.\n.\n.\n.\n.\n.\n2018GC22026 號地塊.                       
8                                                               
9                                                               
10                                                              
11                                                              
12  Land parcels A, B, C and D for                Guigang 貴港    
13  the commercial and residential                              
14  project\nin Station Plaza at                                

                      2          3          4  
0                                   Land       
1   Land Use             Site Area  Premium    
2                                   (RMB       
3                        (sq.m.)    thousand)  
4   Commercial and       97,407.3   759,400    
5   residential                                
6   Wholesale,\nretail,  159,878.4  1,067,260  
7   residential,                               
8   catering,                                  
9   commercial and                             
10  financial and                              
11  residential                                
12  Commercial and       139,600.2  631,870    
13  residential                                
14                               

Я также попробовал Tabula, которая включала больше результатов, но все еще не все.

2 ответа

Вы можете попробовать следующий код, используя параметр table_areas для указания границ таблицы:

tables=camelot.read_pdf("data.pdf", pages='1',flavor='stream',table_areas=['0,800,800,0'])

Более подробная информация на https://camelot-py.readthedocs.io/en/master/user/advanced.html.

Не уверен почему camelot не работает. Попробуйте вместо этого pdfminer. Хорошо работает на вашем образце:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

pdf_rm = PDFResourceManager()
with StringIO() as s:
    with TextConverter(pdf_rm, s, laparams=LAParams()) as d:
        with open('data.pdf', 'rb') as f:
            interpreter = PDFPageInterpreter(pdf_rm, d)
            for page in PDFPage.get_pages(f):
                interpreter.process_page(page)
            text = s.getvalue()
        s.close()

print(text)

Выход:

Land Parcel

City

Land Use

Site Area

Land Parcel No. CTP-1813

CTP-1813 號地塊 . . . . . . . . . . .

Land Parcel No. 2018GC22026

2018GC22026 號地塊. . . . . . . .

Land parcels A, B, C and D for the commercial and residential project in Station Plaza at Guigang City 貴港市高鐵站前廣場商住項目 A、B、C及D地塊 . . . . . . . . . . .

Land Parcel No. 201821 and

No. 201822 為201821號及201822 號地塊. . Land Parcel No. QZ(18)049 and

No. QZ(18)050 QZ(18)049號和QZ(18)050號地 塊 . . . . . . . . . . . . . . . . . . . . . . . .

Land Parcel

No. 630102102006GB00321 630102102006GB00321 號地塊 . . . . . . . . . . . . . . . . . . . .

Land Parcel No. Xing Zheng

Chu (2018)45-1 滎政儲(2018)45-1號地塊 . . . . .

Land Parcel

No. XH2018GC012-1, No. XH2018GC012-2 and No. XH2018GC012-3 XH2018GC012-1號、 XH2018GC012-2號和 XH2018GC012-3號地塊. . . . . .

Land Parcel No. 2018-52

2018-52號地塊 . . . . . . . . . . . . .

Land Parcel B No. Yan

J[2018]Z003 of the Xikou Old Residence Renovation 煙J[2018]Z003號西口舊居改造 B地塊. . . . . . . . . . . . . . . . . . . . .

of Guihuang Road in Chengxin District 靈川縣城新區桂黃公路東側地 塊 . . . . . . . . . . . . . . . . . . . . . . . .

Land Parcel No. BS18-1J-307

BS18-1J-307號地塊 . . . . . . . . .

Land Parcel No. Jing Tu Zheng

Chu Gua (Shun) [2018]043 京土整儲掛(順)[2018]043號地 塊 . . . . . . . . . . . . . . . . . . . . . . . .

Land

Premium

(RMB

thousand) 759,400

Cangzhou 滄州 Commercial and

(sq.m.) 97,407.3

Beihai 北海

residential

Wholesale, retail,

159,878.4

1,067,260

residential, catering, commercial and financial and residential

Guigang 貴港 Commercial and

residential

139,600.2

631,870

Yancheng 鹽城 Commercial and

167,738.0

339,400

residential

Guiyang 貴陽 Commercial and

117,023.0

342,050

residential

Xining 西寧

Commercial and

77,075.5

404,635

residential

Xingyang 滎陽 Commercial

72,351.7

260,400

Taizhou 泰州

Commercial and

217,681.3

728,520

residential

Xuzhou 徐州

Residential

74,448.6

1,203,000

Yantai 煙臺

Residential,

107,015.1

205,776

commercial service, public management and public service

Commercial and

63,442.7

62,820

residential

Chongqing 重慶 Residential

136,246.3

238,700

Beijing 北京

Class-2

69,856.0

2,330,000

residential, institutional pension facilities and basic educational

– 4 –

Land Parcel located to the east

Guilin 桂林
Другие вопросы по тегам