Сопоставить "байтовый интервал" с текстовым документом Python

Question

Сопоставить "байтовый интервал" с текстовым документом Python

Я работаю с аннотированным корпусом, который содержит два набора файлов.txt. Первый набор содержит аннотированные документы (например, статьи, посты в блогах и т. Д.), А второй набор содержит фактические аннотации. Способ сопоставления аннотации с аннотированным текстом осуществляется через "байтовый интервал". Из файла readme:

"The span is the starting and ending byte of the annotation in 
the document.  For example, the annotation listed above is from 
the document, temp_fbis/20.20.10-3414.  The span of this annotation 
is 730,740.  This means that the start of this annotation is 
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740 
is the character after the last character of the annotation."

Итак, вопрос: как мне проиндексировать начальный и конечный байт в документе, чтобы я мог сопоставить аннотацию с текстом в исходном документе? Есть идеи? Я работаю в Python над этим...

0

python nlp tagged-corpus

Источник

user641688 28 окт '11 в 20:21

2 ответа

Другие вопросы по тегам python nlp tagged-corpus

user176569 28 окт '11 в 20:34 2011-10-28 20:34 · Answer 1 · 2011-10-28 20:34

#open, seek, read
start, end = 730,740
f = open("myfile", "rb")
try:
    f.seek(start)
    while start > end
        byte = f.read(1)
        # Do stuff with byte.
        start -= 1
finally:
    f.close()

0

Источник

user176569 28 окт '11 в 20:34

user176569 28 окт '11 в 20:27 2011-10-28 20:27 · Answer 2 · 2011-10-28 20:27

"This means that the start of this annotation is 
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740 
is the character after the last character of the annotation.

     blah, blah, blah, example annotation, blah, blah, blah
                       |                 |
                  start byte          end byte

The data_type of all annotations should be 'string'."

0

Источник

user176569 28 окт '11 в 20:27