Использование BeautifulSoup для поиска тега HTML, который содержит определенный текст

Question

Использование BeautifulSoup для поиска тега HTML, который содержит определенный текст

Я пытаюсь получить элементы в HTML-документе, которые содержат следующий шаблон текста: #\S{11}

<h2> this is cool #12345678901 </h2>

Таким образом, предыдущий будет соответствовать с помощью:

soup('h2',text=re.compile(r' #\S{11}'))

И результаты будут примерно такими:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

Я могу получить весь текст, который соответствует (см. Строку выше). Но я хочу, чтобы родительский элемент текста совпадал, поэтому я могу использовать его в качестве отправной точки для обхода дерева документа. В этом случае я бы хотел, чтобы все элементы h2 возвращались, а не совпадения текста.

Идеи?

74

python regex beautifulsoup html-content-extraction

Источник

user85271 14 май '09 в 21:46

3 ответа

Решение

Поисковые операции BeautifulSoup доставляют [список] BeautifulSoup.NavigableString объекты, когда text= используется в качестве критерия, в отличие от BeautifulSoup.Tag в других случаях. Проверьте объект __dict__ чтобы увидеть атрибуты, доступные для вас. Из этих атрибутов parent предпочтительнее previous из-за изменений в BS4.

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

24

Источник

user117471 12 ноя '12 в 18:05

С bs4 (Beautiful Soup 4) попытка ОП работает точно так же, как и ожидалось:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

возвращается [<h2> this is cool #12345678901 </h2>],

5

Источник

user3358599 20 янв '18 в 20:17

Другие вопросы по тегам python regex beautifulsoup html-content-extraction

user17160 14 май '09 в 21:53 2009-05-14 21:53 · Accepted Answer · 2009-05-14 21:53

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

Печать:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>