UnicodeDecodeError с использованием Biopython для получения реферата из efetch

Question

UnicodeDecodeError с использованием Biopython для получения реферата из efetch

В последнее время с помощью Biopython для извлечения некоторых рефератов из Pubmed. Мой код написан на Python3, как показано ниже:

from Bio import Entrez

Entrez.email = "myemail@example.com"    # Always tell NCBI who you are


def get_number():    #Get the total number of abstract available in Pubmed
    handle = Entrez.egquery(term="allergic contact dermatitis ")
    record = Entrez.read(handle)
    for row in record["eGQueryResult"]:
        if row["DbName"]=="pubmed":
            return int(row["Count"])


def get_id():    #Get all the ID of the abstract available in Pubmed
    handle = Entrez.esearch(db="pubmed", term="allergic contact dermatitis ", retmax=200)
    record = Entrez.read(handle)
    idlist = record["IdList"]
    return idlist

idlist = get_id()

for ids in idlist:    #Download the abstract based on their ID
    handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text")    # Retmode Can Be txt / json / xml / csv
    f = open("{}.txt".format(ids), "w")    # Create a TXT file with the name of ID
    f.write(handle.read())    #Write the abstract to the TXT file

Я хочу получить 200 тезисов, но получится только три или четыре тезиса. Затем происходит ошибка:

UnicodeDecodeError: 'cp950' codec can't decode byte 0xc5 in position 288: illegal multibyte sequence

handle.read() Кажется, что есть проблемы с теми абстрактными, в которых есть определенный символ или слова. Я пытаюсь использовать print знать класс handle:

handle = Entrez.efetch(db="pubmed", id=idlist, rettype="abstract", retmode="text")
print(handle)

Результат:

<_io.TextIOWrapper encoding='cp950'>

Я уже искал много страниц для решения, но ни одна из них не работает. Кто-нибудь может помочь?

2

python-3.x biopython pubmed

Источник

user7744979 21 мар '17 в 11:34

1 ответ

Другие вопросы по тегам python-3.x biopython pubmed

user7657104 22 мар '17 в 14:38 2017-03-22 14:38 · Answer 1 · 2017-03-22 14:38

Для меня ваш код работает нормально. Это проблема кодирования на вашем сайте. Вы можете открыть файл в байтовом режиме и закодировать текст в utf-8. Вы можете попробовать обходной путь, как это:

for ids in idlist:    #Download the abstract based on their ID
    handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text")    # Retmode Can Be txt / json / xml / csv
    f = open("{}.txt".format(ids), "wb")    # Create a TXT file with the name of ID
    f.write(handle.read().encode('utf-8'))