Очистка поиска Google с помощью BeautifulSoup
Я хотел очистить несколько страниц поиска Google. До сих пор мне удавалось скрести только первую страницу, но как я мог сделать это для нескольких страниц.
from bs4 import BeautifulSoup
import requests
import urllib.request
import re
from collections import Counter
def search(query):
url = "http://www.google.com/search?q="+query
text = []
final_text = []
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text,"html.parser")
for desc in soup.find_all("span",{"class":"st"}):
text.append(desc.text)
for title in soup.find_all("h3",attrs={"class":"r"}):
text.append(title.text)
for string in text:
string = re.sub("[^A-Za-z ]","",string)
final_text.append(string)
count_text = ' '.join(final_text)
res = Counter(count_text.split())
keyword_Count = dict(sorted(res.items(), key=lambda x: (-x[1], x[0])))
for x,y in keyword_Count.items():
print(x ," : ",y)
search("girl")
3 ответа
url = "http://www.google.com/search?q=" + query + "&start=" + str((page - 1) * 10)
Код ниже выполняет фактическую разбивку на страницы через ссылку кнопки «Далее».
from bs4 import BeautifulSoup
import requests, urllib.parse
import lxml
def print_extracted_data_from_url(url):
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
print(f'Current URL: {url}')
print()
for container in soup.findAll('div', class_='tF2Cxc'):
head_text = container.find('h3', class_='LC20lb DKV0Md').text
head_sum = container.find('div', class_='IsZvec').text
head_link = container.a['href']
print(head_text)
print(head_sum)
print(head_link)
print()
return soup.select_one('a#pnnext')
def scrape():
next_page_node = print_extracted_data_from_url(
'https://www.google.com/search?hl=en-US&q=coca cola')
while next_page_node is not None:
next_page_url = urllib.parse.urljoin('https://www.google.com', next_page_node['href'])
next_page_node = print_extracted_data_from_url(next_page_url)
scrape()
Часть вывода:
Results via beautifulsoup
Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=coca cola
The Coca-Cola Company: Refresh the World. Make a Difference
We are here to refresh the world and make a difference. Learn more about the Coca-Cola Company, our brands, and how we strive to do business the right way.Careers · Contact Us · Jobs at Coca-Cola · Our Company
https://www.coca-colacompany.com/home
Coca-Cola
2021 The Coca-Cola Company, all rights reserved. COCA-COLA®, "TASTE THE FEELING", and the Contour Bottle are trademarks of The Coca-Cola Company.
https://www.coca-cola.com/
Кроме того, вы можете сделать это с помощью API результатов поисковой системы Google от SerpApi. Это платный API с бесплатной пробной версией 5000 поисковых запросов.
Код для интеграции:
import os
from serpapi import GoogleSearch
def scrape():
params = {
"engine": "google",
"q": "coca cola",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
print(f"Current page: {results['serpapi_pagination']['current']}")
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
while 'next' in results['serpapi_pagination']:
search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
results = search.get_dict()
print(f"Current page: {results['serpapi_pagination']['current']}")
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
scrape()
Часть вывода:
Results from SerpApi
Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=coca cola
The Coca-Cola Company: Refresh the World. Make a Difference
We are here to refresh the world and make a difference. Learn more about the Coca-Cola Company, our brands, and how we strive to do business the right way.Careers · Contact Us · Jobs at Coca-Cola · Our Company
https://www.coca-colacompany.com/home
Coca-Cola
2021 The Coca-Cola Company, all rights reserved. COCA-COLA®, "TASTE THE FEELING", and the Contour Bottle are trademarks of The Coca-Cola Company.
https://www.coca-cola.com/
Отказ от ответственности, я работаю на SerpApi.
Как комментарий выше, вам нужно URL следующей страницы и поместить код в цикл
def search(query):
url = "https://www.google.com/search?hl=en&q=" + query
while url:
text = []
....
....
for x,y in keyword_Count.items():
print(x ," : ",y)
# get next page url
url = soup.find('a', id='pnnext')
if url:
url = 'https://www.google.com/' + url['href']
else:
print('no next page, loop ended')
break
Делать soup.find('a', id='pnnext')
для работы вам может потребоваться установить user-agent для запросов