Ошибка Pytorch "RuntimeError: индекс вне диапазона: попытка доступа к индексу 512 вне таблицы с 511 строками"
У меня есть предложения, которые я векторизую с помощью метода scheme_vector() модуля Python BiobertEmbedding (https://pypi.org/project/biobert-embedding/). Для некоторых групп предложений у меня нет проблем, но для некоторых других у меня появляется следующее сообщение об ошибке:
Файл "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", строка 133, в файле Offer_vector encoded_layers = self.eval_fwdprop_biobert(tokenized_text) "/home/nobunaga/.local/ lib / python3.6 / site-packages / biobert_embedding / embedding.py ", строка 82, в eval_fwdprop_biobert encoded_layers, _ = self.model(tokens_tensor, segment_tensors) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", строка 547, в __call__ result = self.forward(*input, **kwargs) Файл"/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/models.py ", строка 730, в прямом embedding_output = self.embeddings(input_ids, token_type_ids) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", строка 547,в __call__ result = self.forward(*input, **kwargs) Файл "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", строка 268, в прямой позиции_embeddings = self.position_embeddings(position_ids) Файл "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", строка 547, в __call__ result = self.forward(*input, **kwargs) Файл "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/sparse.py", строка 114, вперед self.norm_type, self.scale_grad_by_freq, self.sparse) Файл "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/functional.py", строка 1467, во вложении return torch.embedding(вес, ввод, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range: попытка получить доступ к индексу 512 вне таблицы с 511 строками.в /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237
Я обнаружил, что для некоторой группы предложений проблема связана с такими тегами, как <tb>
например. Но для других, даже когда теги удалены, сообщение об ошибке все еще существует.
(К сожалению, я не могу поделиться кодом из соображений конфиденциальности)
У вас есть идеи, в чем может быть проблема?
Заранее спасибо
РЕДАКТИРОВАТЬ: вы правы, cronoik, лучше будет на примере.
sentences = ["This is the first sentence.", "This is the second sentence.", "This is the third sentence."
biobert = BiobertEmbedding(model_path='./biobert_v1.1_pubmed_pytorch_model')
vectors = [biobert.sentence_vector(doc) for doc in sentences]
На мой взгляд, именно эта последняя строка кода вызвала сообщение об ошибке.
2 ответа
Проблема в том, что модуль встраивания биобертов не заботится о максимальной длине последовательности 512 (токены, а не слова!). Это соответствующий исходный код. Взгляните на приведенный ниже пример, чтобы вызвать полученную ошибку:
from biobert_embedding.embedding import BiobertEmbedding
#sentence has 385 words
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'
biobert = BiobertEmbedding()
print('sentence has {} tokens'.format(len(biobert.process_text(sentence))))
print('longersentence has {} tokens'.format(len(biobert.process_text(longersentence))))
#didn't work
sentence has 512 tokens
longersentence has 513 tokens
#your error message....
Что вам нужно сделать, так это реализовать подход со скользящим окном для обработки этих текстов:
import torch
from biobert_embedding.embedding import BiobertEmbedding
maxtokens = 512
startOffset = 0
docStride = 200
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'
sentences = [sentence, longersentence, 'small test sentence']
vectors = []
biobert = BiobertEmbedding()
def sentence_vector(tokenized_text, biobert):
encoded_layers = biobert.eval_fwdprop_biobert(tokenized_text)
# `encoded_layers` has shape [12 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = encoded_layers[11][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
return sentence_embedding
for doc in sentences:
#tokenize your text
docTokens = biobert.process_text(doc)
while startOffset < len(docTokens):
length = min(len(docTokens) - startOffset, maxtokens)
#now we calculate the sentence_vector for the document slice
, biobert)
#stop when the whole document is processed (document has less than 512
#or the last document slice was processed)
if startOffset + length == len(docTokens):
startOffset += min(length, docStride)
startOffset = 0
PS: Ваш частичный успех с удалением <tb>
было возможно, потому что удаление <tb>
удалит 4 токена ('<', 't', '## b', '>').
Поскольку исходный BERT имеет размер 512 (0 - 511), позиционное кодирование, а bioBERT является производным от BERT, неудивительно, что ошибка индекса для 512. Однако немного странно, что вы можете получить доступ к 512 для некоторых предложений. как вы упомянули.