Python: как создавать вложенные словари из списков с конкретными значениями

Заранее прошу прощения за длинный пост, но я позаботился о том, чтобы за ним было легко следить и очень четко.

У меня вопрос такой:

Как я могу создать вложенный словарь из списков с указанными дублирующимися ключами?

Вот пример того, что я хотел бы сделать, используя данные для вымышленной новостной статьи:

{'http://www.SomeNewsWebsite.com/Article12345': 
 {'Title': 'Trump Does Another Ridiculous Thing', 
  'Source': 'Some News Website', 
  'Thumbnail': 'SomeNewsWebsite.com/image12345'}} 

Читая аналогичный пост, я видел, как люди делают похожие вещи, но изо всех сил пытались перенести эти идеи в мою собственную работу.

Это конец моего вопроса. Ниже я разместил свой список кодов и примеров, сгенерированный указанным кодом, который я бы использовал для создания этого вложенного словаря. Это также доступно на моем Github.

Пока что я могу использовать следующий код для извлечения данных, вырезать важные биты, а затем составить два списка - один для URL-адресов, один для заголовков. Затем он использует Zip, чтобы объединить их в аккуратный словарь.

url = "http://www.reuters.com"

source = "Reuters"

thumbnail = "http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png"


def soup():
    """ Fetches HTML from site and turns it into a bs4 object. """
    get_html = requests.get(url)
    html = get_html.text
    make_soup = BeautifulSoup(html, 'html.parser')
    return make_soup


# Tell bs4 where to find the important information (headlines, URLs)
important_data = (soup().select(".story-content > .story-title > a"))


# Turn that important data into a string so it may be parsed using RegEx
stringed_data = ' || '.join(str(v) for v in important_data)


def get_headline():
    """ Uses Regular Expressions to find headlines. Returns a list. """
    headline = re.findall(r'(?<=">)(.*?)(?=</a>)', stringed_data)
    return headline


def get_link():
    """ Uses Regular Expressions to find links. Returns a list. """
    link = re.findall(r'(?<=<a href=")(.*?)(?=")', stringed_data)
    return link

def build_dict():
    """ Combine everything into a tidy dictionary. """
    full_urls = [i if i.startswith('http') else url + i for i in get_link()]
    reuters_dictionary = dict(zip(get_headline(), full_urls))
    return full_urls

get_link()
get_headline()
soup()
build_dict()

При запуске этот код создаст 2 списка, а затем словарь. Пример данных приведен ниже:

List of titles:(29 items long)
['Trump strikes defiant tone ahead of debate', 'Matthew swamps North Carolina, still dangerous as it heads out to sea', "Tesla's Musk says will not have to raise funds in fourth-quarter", 'Suspect arrested in fatal shooting of two California police officers', 'Russia says U.S. actions threaten its national security', 'Western-backed coalition under pressure over Yemen raid', "Fed's Fischer says job gains solid, expects growth to pick up", "Thai king's condition unstable after hemodialysis treatment: palace", 'Pope names new group of cardinals, adding to potential successors', 'Palestinian kills two people in Jerusalem, then shot dead: police', "Commentary: House of Lies — the uncanny allure of 'Girl on the Train'", 'Earnings season begins as White House race heats up', 'Russia expects OPEC to ask non members to consider joining output curb', 'Banks ponder the meaning of life as Deutsche agonizes', 'IMF says still engaged with Greece, no decision yet on bailout role', 'Pound slump exacerbates Brexit impact for German exporters: DIHK', 'Iranian, Iraqi oil ministers will not attend Istanbul talks: sources', 'Ukraine military postpones withdrawal from town, cites rebel shelling', 'German police make new raid in hunt for refugee planning bomb attack', "South African President Zuma's rape accuser dies: family", 'Xi says China must speed up plans for domestic network technology', 'UberEats to expand to Berlin in 2017: Tagesspiegel', 'Beijing, Shanghai propose curbs on who can drive for ride-hailing services', 'Pressure on Trump likely to be intense at second debate with Clinton', "Sanders supporters seethe over Clinton's leaked remarks to Wall St.", 'Evangelical leaders stick with Trump, focus on defeating Clinton', 'Citi sells its Argentinian consumer business to Banco Santander', "Itaú to pay $220 million for Citigroup's Brazil assets", 'LafargeHolcim agrees sale of Chilean business Cemento Polpaico']


List of URLs: (29 items long)
['/article/us-usa-election-idUSKCN1290JZ', '/article/us-storm-matthew-idUSKCN129063', '/article/us-tesla-equity-solarcity-idUSKCN1290QW', '/article/us-california-police-shooting-idUSKCN1280YH', '/article/us-russia-usa-idUSKCN1290DP', '/article/us-yemen-security-coalition-pressure-idUSKCN1290JM', '/article/us-usa-fed-fischer-idUSKCN1290JB', '/article/us-thailand-king-idUSKCN1290R8', '/article/us-pope-cardinals-idUSKCN1290C9', '/article/us-israel-palestinians-violence-idUSKCN129070', '/article/us-society-entertainment-film-idUSKCN127229', '/article/us-usa-stocks-weekahead-idUSKCN1272HS', '/article/us-oil-opec-russia-idUSKCN1290KD', '/article/us-imf-g20-banks-idUSKCN1290DX', '/article/us-imf-g20-greece-idUSKCN1290R6', '/article/us-britain-eu-germany-idUSKCN1290TZ', '/article/us-oil-opec-istanbul-idUSKCN1290N2', '/article/us-ukraine-crisis-withdrawal-idUSKCN1290UL', '/article/us-germany-bomb-idUSKCN1290D2', '/article/us-safrica-zuma-idUSKCN1290SX', '/article/us-china-internet-security-idUSKCN1290LA', '/article/us-uber-germany-eats-idUSKCN1290OB', '/article/us-china-regulations-ride-hailing-idUSKCN1280EL', '/article/us-usa-election-debate-idUSKCN1290AS', '/article/us-usa-election-clinton-idUSKCN1280Z9', '/article/us-usa-election-trump-evangelicals-idUSKCN1280WE', '/article/us-citi-argentina-m-a-banco-santander-ri-idUSKCN1290SD', '/article/us-citibank-brasil-m-a-itau-unibco-hldg-idUSKCN1280HM', '/article/us-lafargeholcim-divestment-chile-idUSKCN1280BU']

Dictionary of titles and URLs: (29 items long)
{'Banks ponder the meaning of life as Deutsche agonizes': 'http://www.reuters.com/article/us-imf-g20-banks-idUSKCN1290DX', 'German police make new raid in hunt for refugee planning bomb attack': 'http://www.reuters.com/article/us-germany-bomb-idUSKCN1290D2', 'Suspect arrested in fatal shooting of two California police officers': 'http://www.reuters.com/article/us-california-police-shooting-idUSKCN1280YH', 'Evangelical leaders stick with Trump, focus on defeating Clinton': 'http://www.reuters.com/article/us-usa-election-trump-evangelicals-idUSKCN1280WE', 'Xi says China must speed up plans for domestic network technology': 'http://www.reuters.com/article/us-china-internet-security-idUSKCN1290LA', "Australia's Rinehart and China's Shanghai CRED agree on deal for Kidman cattle empire": 'http://www.reuters.com/article/us-australia-china-landsale-dakang-p-f-idUSKCN12908O', 'LafargeHolcim agrees sale of Chilean business Cemento Polpaico': 'http://www.reuters.com/article/us-lafargeholcim-divestment-chile-idUSKCN1280BU', 'Citi sells Argentinian consumer unit a day after Brazil sale': 'http://www.reuters.com/article/us-citi-argentina-m-a-banco-santander-ri-idUSKCN1290SD', 'Beijing, Shanghai propose curbs on who can drive for ride-hailing services': 'http://www.reuters.com/article/us-china-regulations-ride-hailing-idUSKCN1280EL', 'Pope names new group of cardinals, adding to potential successors': 'http://www.reuters.com/article/us-pope-cardinals-idUSKCN1290C9', "Commentary: House of Lies — the uncanny allure of 'Girl on the Train'": 'http://www.reuters.com/article/us-society-entertainment-film-idUSKCN127229', 'Iranian, Iraqi oil ministers will not attend Istanbul talks: sources': 'http://www.reuters.com/article/us-oil-opec-istanbul-idUSKCN1290N2', "South African President Zuma's rape accuser dies: family": 'http://www.reuters.com/article/us-safrica-zuma-idUSKCN1290SX', 'Palestinian kills two people in Jerusalem, then shot dead: police': 'http://www.reuters.com/article/us-israel-palestinians-violence-idUSKCN129070', 'Matthew swamps North Carolina, still dangerous as it heads out to sea': 'http://www.reuters.com/article/us-storm-matthew-idUSKCN129063', 'Western-backed coalition under pressure over Yemen raid': 'http://www.reuters.com/article/us-yemen-security-coalition-pressure-idUSKCN1290JM', 'Trump strikes defiant tone ahead of debate': 'http://www.reuters.com/article/us-usa-election-idUSKCN1290JZ', 'Russia says U.S. actions threaten its national security': 'http://www.reuters.com/article/us-russia-usa-idUSKCN1290DP', 'Pressure on Trump likely to be intense at second debate with Clinton': 'http://www.reuters.com/article/us-usa-election-debate-idUSKCN1290AS', "Sanders supporters seethe over Clinton's leaked remarks to Wall St.": 'http://www.reuters.com/article/us-usa-election-clinton-idUSKCN1280Z9', "Tesla's Musk says will not have to raise funds in fourth-quarter": 'http://www.reuters.com/article/us-tesla-equity-solarcity-idUSKCN1290QW', "Fed's Fischer says job gains solid, expects growth to pick up": 'http://www.reuters.com/article/us-usa-fed-fischer-idUSKCN1290JB', 'Ukraine military postpones withdrawal from town, cites rebel shelling': 'http://www.reuters.com/article/us-ukraine-crisis-withdrawal-idUSKCN1290UL', "Thai king's condition unstable after hemodialysis treatment: palace": 'http://www.reuters.com/article/us-thailand-king-idUSKCN1290R8', 'Earnings season begins as White House race heats up': 'http://www.reuters.com/article/us-usa-stocks-weekahead-idUSKCN1272HS', 'IMF says still engaged with Greece, no decision yet on bailout role': 'http://www.reuters.com/article/us-imf-g20-greece-idUSKCN1290R6', 'Pound slump exacerbates Brexit impact for German exporters: DIHK': 'http://www.reuters.com/article/us-britain-eu-germany-idUSKCN1290TZ', 'Russia expects OPEC to ask non members to consider joining output curb': 'http://www.reuters.com/article/us-oil-opec-russia-idUSKCN1290KD', 'UberEats to expand to Berlin in 2017: Tagesspiegel': 'http://www.reuters.com/article/us-uber-germany-eats-idUSKCN1290OB'}

Для ясности я хотел бы использовать эти данные для создания словаря для каждой пары заголовка и URL, например:

{'http://www.reuters.com/article/us-imf-g20-banks-idUSKCN1290DX': 
 {'Title': 'Banks ponder the meaning of life as Deutsche agonizes',
  'Source': 'Reuters', 
  'Thumbnail': 'http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png'}}

Спасибо большое за то, что нашли время прочитать, и заранее благодарю за помощь.

2 ответа

Рассмотрим словарь понимания:

newsdict = {v: {'Title': k, 
                'Source': 'Reuters', 
                'Thumbnail': 'http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png'} 
           for k, v in reuters_dictionary.items()}

Это должно дать вам результат, который вы хотите:

def build_dict():
    """ Combine everything into a tidy dictionary. """
    full_urls = [i if i.startswith('http') else url + i for i in get_link()]
    reuters_dictionary = {}
    for (headline, url) in zip(get_headline(), full_urls):
        reuters_dictionary[url] = {
            'Title': headline,
            'Source': source,
            'Thumbnail' : thumbnail
        }
    return full_urls # <- I think you want to do "return reuters_dictionary" here(?)

Однако здесь нет ничего о дубликатах ключей. Почему вы чувствуете потребность в дубликатах ключей?

Также вам, вероятно, следует выполнить рефакторинг для удаления этих глобальных переменных.

Наконец, если вы уже используете BeatifulSoup, почему вы потом возвращаетесь к регулярным выражениям? Я думаю, что использование BeautifulSoup везде должно быть более надежным.

Другие вопросы по тегам