Газета (питон) получить все CNN новости URL

Question

Газета (питон) получить все CNN новости URL

Например, в этом URL ( https://edition.cnn.com/search/?q=%20news&size=10&from=5540&page=555)

В HTML-файле я могу найти эту ссылку (HTML-тег)

<div class="cnn-search__result-thumbnail">         
     <a href="https://www.cnn.com/2018/03/27/asia/north-korea-kim-jong-un-china-visit/index.html">
  <img src="./Search CNN - Videos, Pictures, and News - 
    CNN.com_files/180328104116china-xi-kim-story-body.jpg">
 </a>

но в этом коде

    cnn_paper = newspaper.build(url, memoize_articles=False)
     for article in cnn_paper.articles:
          print(article.url)

я не могу найти ссылку на новость

https://edition.cnn.com/search/?q=%20news&size=10&from=5540&page=555 https://edition.cnn.com/search/?q=%20news&size=10&from=5550&page=556

получить ту же ссылку

0

python html python-newspaper

Источник

user8269694 02 авг '18 в 02:20

2 ответа

Другие вопросы по тегам python html python-newspaper

28 май '21 в 00:54 2021-05-28 00:54 · Answer 1 · 2021-05-28 00:54

Результаты поиска динамически отображаются из файла JSON из другого запроса:https://search.api.cnn.io/content?q=news&amp;size=50&amp;from=0

размер может быть 50 при макс.

      res = requests.get("https://search.api.cnn.io/content?q=news&size=50&from=0")
links = [x['url'] for x in res.json()['result']]

1

Источник

28 май '21 в 00:54

user7420623 07 авг '18 в 03:36 2018-08-07 03:36 · Answer 2 · 2018-08-07 03:36

Делает ли это то, что вы хотите?

from bs4 import BeautifulSoup
import urllib.request

for numb in ('1', '100'):
    resp = urllib.request.urlopen("https://edition.cnn.com/search/?q=%20news&size=10&from=5540&page=555")
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

    for link in soup.find_all('a', href=True):
        print(link['href'])

Или, может быть, это?

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

resp = requests.get("https://edition.cnn.com/search/?q=%20news&size=10&from=5540&page=555")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link)