Назовите группы из результатов испытаний Tueky в соответствии со значительными

У меня есть тестовая таблица Тьюки в результате pairwise_tukeyhsd от питона statsmodels.stats.multicomp,

    group1  group2  meandiff    lower   upper   reject
0   101     102     0.2917    -0.0425   0.6259  False
1   101     103     0.1571    -0.1649   0.4792  False
2   101     104     -0.1333   -0.4675   0.2009  False
3   101     105     0.0833    -0.2509   0.4175  False
4   101     106     -0.0500   -0.3626   0.2626  False
5   102     103     -0.1345   -0.4566   0.1875  False
6   102     104     -0.4250   -0.7592  -0.0908  True
7   102     105     -0.2083   -0.5425   0.1259  False
8   102     106     -0.3417   -0.6543  -0.0290  True
9   103     104     -0.2905   -0.6125   0.0316  False
10  103     105     -0.0738   -0.3959   0.2482  False
11  103     106     -0.2071   -0.5067   0.0924  False
12  104     105     0.2167    -0.1175   0.5509  False
13  104     106     0.0833    -0.2293   0.3960  False
14  105     106     -0.1333   -0.4460   0.1793  False

У меня есть эта таблица как pandasdf, Я хотел бы обозначить (буквами) группы (101-106), обозначающие статистические отношения. Для этого конкретного примера желаемый результат будет: (Я не против, если результаты будут df, список, словарь)

group   label
101      ab
102      a
103      ab
104      b
105      ab
106      b

Как видите, все группы с одинаковыми буквами имеют одинаковое среднее значение (столбец отклонения = False), а группы с разными буквами (столбец отклонения = True) имеют различное среднее значение. Например, среднее значение группы 101 равно значению всех других групп, потому что группа 101 имеет букву ab, а все остальные группы имеют либо a, либо b, либо ab. С другой стороны, группа 106 имеет только букву b, которая указывает, что она похожа на все группы, за исключением группы 102, которая имеет только букву a.

Я не мог найти автоматическое решение Python для этого. Я видел, что R имеет пакет для этого называется multcompLettersЕсть ли что-то подобное в Python?

1 ответ

Решение

Итак, после пары дней, проведенных за ним, и без каких-либо предложенных ответов / комментариев от других пользователей, я думаю, что я понял это. Допустим, таблица из моего вопроса называется df, Следующий скрипт предназначен для моих нужд, но я надеюсь, что он может помочь другим. Я добавил комментарии, чтобы облегчить понимание.

    df_True = df.loc[df.reject==True,:]

    letters = list(string.ascii_lowercase)
    n = 0

    group1_list = df_True.group1.tolist() #get the groups from the df with only True (True df) to a list
    group2_list = df_True.group2.tolist()
    group3 = group1_list+group2_list #concat both lists
    group4 = list(set(group3)) #get unique items from the list
    group5 = [str(i) for i in group4 ] #convert unicode to a str
    group5.sort() #sort the list

    gen = ((i, 0) for i in group5) #create dict with 0 so the dict won't be empty when starts
    dictionary = dict(gen)

    group6 = [(group5[i],group5[j]) for i in range(len(group5)) for j in range(i+1, len(group5))] #get all combination pairs
    for pairs in group6: #check for each combination if it is present in df_True

        print n
        print dictionary

        try:        
            a = df_True.loc[(df_True.group1==pairs[0])&(df_True.group2==pairs[1]),:] #check if the pair exists in the df

        except:
            a.shape[0] == 0 

        if a.shape[0] == 0: #it mean that the df is empty as it does not appear in df_True so this pair is equal

            print 'equal'

            if dictionary[pairs[0]] != 0 and dictionary[pairs[1]] == 0: #if the 1st is populated but the 2nd in not populated
                print "1st is populated and 2nd is empty"      
                dictionary[pairs[1]] = dictionary[pairs[0]]

            elif dictionary[pairs[0]] != 0 and dictionary[pairs[1]] != 0: #if both are populated, check matching labeles
                print "both are populated"
                if len(list(set([c for c in dictionary[pairs[0]] if c in dictionary[pairs[1]]]))) >0: #check if they have a common label
                        print "they have a shared character"
                else:
                    print "equal but have different labels"
                 #check if the 1st group label doesn't appear in anyother labels, if it is unique then the 2nd group can have the first group label
                    m = 0 #count the number of groups that have a shared char with 1st group
                    j = 0 #count the number of groups that have a shared char with 2nd group
                    for key, value in dictionary.iteritems():
                        if key != pairs[0] and len(list(set([c for c in dictionary[pairs[0]] if c in value])))==0:
                            m+=1
                    for key, value in dictionary.iteritems():
                        if key != pairs[1] and len(list(set([c for c in dictionary[pairs[1]] if c in value])))==0:
                            j+=1
                    if m == len(dictionary)-1 and j == len(dictionary)-1: #it means that this value is unique because it has no shared char with another group
                        print "unique"
                        dictionary[pairs[1]] = dictionary[pairs[0]][0]

                    else:
                        print "there is at least one group in the dict that shares a char with the 1st group"                                 
                        dictionary[pairs[1]] = dictionary[pairs[1]] + dictionary[pairs[0]][0]


            else:  # if it equals 0, meaning if the 1st is empty (which means that the 2nd must be also empty)
                print "both are empty"
                dictionary[pairs[0]] = letters[n]
                dictionary[pairs[1]] = letters[n]

        else:

            print "not equal"

            if dictionary[pairs[0]] != 0: # if the first one is populated (has a value) then give a value only to the second 

                print '1st is populated'            
                # if the 2nd is not empty and they don't share a charcter then no change is needed as they already have different labels
                if dictionary[pairs[1]] != 0 and len(list(set([c for c in dictionary[pairs[0]] if c in dictionary[pairs[1]]]))) == 0: 
                    print "no change"

                elif dictionary[pairs[1]] == 0: #if the 2nd is not populated give it a new letter
                    dictionary[pairs[1]] = letters[n+1]

                #if the 2nd is populated and equal to the 1st, then change the letter of the 2nd to a new one and assign its original letter to all the others that had the same original letter       
                elif  dictionary[pairs[1]] != 0 and len(list(set([c for c in dictionary[pairs[0]] if c in dictionary[pairs[1]]]))) > 0:
                    #need to check that they don't share a charcter
                    print "need to add a letter"
                    original_value = dictionary[pairs[1]]
                    dictionary[pairs[1]] = letters[n]
                    for key, value in dictionary.iteritems():
                        if key != pairs[0] and len(list(set([c for c in original_value if c in value])))>0: #for any given value, check if it had a character from the group that will get a new letter, if so, it means  that they are equal and thus the new letter should also appear in the value of the "old" group 
                            dictionary[key] = original_value + letters[n]  #add the original letter of the group to all the other groups it was similar to               

            else:

                print '1st is empty'
                dictionary[pairs[0]] = letters[n]
                dictionary[pairs[1]] = letters[n+1]
                print dictionary
            n+=1

    # get the letter out the dictionary

    labels = list(dictionary.values())
    labels1 = list(set(labels))
    labels1.sort()
    final_label = ''.join(labels1)

    for GroupName in group_names:
        if GroupName in dictionary:
            print "already exists"
        else:
            dictionary[GroupName] = final_label

    for key, value in dictionary.iteritems(): #this keeps only the unique char per group and sort it by group
        dictionary[key] =  ''.join(set(value))

Спасибо за ваш вклад. Мне пришлось немного изменить ваш код, чтобы исправить некоторые недостающие вещи и адаптироваться к python3. Основные изменения были

  • строка импорта (отсутствовала)
  • измените dictionary.iteritems на dictionary.items (python3)
  • преобразовать все print "..." в print("...") (python3)
  • переменная group_names отсутствовала
  • заставить GroupName быть str в цикле group_name
  • отсортировать окончательный словарь в dict2

Ваши исходные данные находятся в файле csv, который теперь называется input2.csv.

,group1,group2,meandiff,lower,upper,reject
0,101,102,0.2917,-0.0425,0.6259,False
1,101,103,0.1571,-0.1649,0.4792,False
2,101,104,-0.1333,-0.4675,0.2009,False
3,101,105,0.0833,-0.2509,0.4175,False
4,101,106,-0.0500,-0.3626,0.2626,False
5,102,103,-0.1345,-0.4566,0.1875,False
6,102,104,-0.4250,-0.7592,-0.0908,True
7,102,105,-0.2083,-0.5425,0.1259,False
8,102,106,-0.3417,-0.6543,-0.0290,True
9,103,104,-0.2905,-0.6125,0.0316,False
10,103,105,-0.0738,-0.3959,0.2482,False
11,103,106,-0.2071,-0.5067,0.0924,False
12,104,105,0.2167,-0.1175,0.5509,False
13,104,106,0.0833,-0.2293,0.3960,False
14,105,106,-0.1333,-0.4460,0.1793,False
import pandas as pd
import numpy as np
import math
import itertools
import string

df = pd.read_csv('input2.csv', index_col=0)

df_True = df.loc[df.reject==True,:]
letters = list(string.ascii_lowercase)
n = 0

group1_list = df_True.group1.tolist() #get the groups from the df with only True (True df) to a list
group2_list = df_True.group2.tolist()
group3 = group1_list+group2_list #concat both lists
group4 = list(set(group3)) #get unique items from the list
group5 = [str(i) for i in group4 ] #convert unicode to a str
group5.sort() #sort the list

gen = ((i, 0) for i in group5) #create dict with 0 so the dict won't be empty when starts
dictionary = dict(gen)

group6 = [(group5[i],group5[j]) for i in range(len(group5)) for j in range(i+1, len(group5))] #get all combination pairs
for pairs in group6: #check for each combination if it is present in df_True

    print(n)
    print(dictionary)

    try:
        a = df_True.loc[(df_True.group1==pairs[0])&(df_True.group2==pairs[1]),:] #check if the pair exists in the df

    except:
        a.shape[0] == 0

    if a.shape[0] == 0: #it mean that the df is empty as it does not appear in df_True so this pair is equal

        print ('equal')

        if dictionary[pairs[0]] != 0 and dictionary[pairs[1]] == 0: #if the 1st is populated but the 2nd in not populated
            print ("1st is populated and 2nd is empty")
            dictionary[pairs[1]] = dictionary[pairs[0]]

        elif dictionary[pairs[0]] != 0 and dictionary[pairs[1]] != 0: #if both are populated, check matching labeles
            print ("both are populated")
            if len(list(set([c for c in dictionary[pairs[0]] if c in dictionary[pairs[1]]]))) >0: #check if they have a common label
                    print ("they have a shared character")
            else:
                print ("equal but have different labels")
             #check if the 1st group label doesn't appear in anyother labels, if it is unique then the 2nd group can have the first group label
                m = 0 #count the number of groups that have a shared char with 1st group
                j = 0 #count the number of groups that have a shared char with 2nd group
                for key, value in dictionary.items():
                    if key != pairs[0] and len(list(set([c for c in dictionary[pairs[0]] if c in value])))==0:
                        m+=1
                for key, value in dictionary.items():
                    if key != pairs[1] and len(list(set([c for c in dictionary[pairs[1]] if c in value])))==0:
                        j+=1
                if m == len(dictionary)-1 and j == len(dictionary)-1: #it means that this value is unique because it has no shared char with another group
                    print ("unique")
                    dictionary[pairs[1]] = dictionary[pairs[0]][0]

                else:
                    print ("there is at least one group in the dict that shares a char with the 1st group")
                    dictionary[pairs[1]] = dictionary[pairs[1]] + dictionary[pairs[0]][0]


        else:  # if it equals 0, meaning if the 1st is empty (which means that the 2nd must be also empty)
            print ("both are empty")
            dictionary[pairs[0]] = letters[n]
            dictionary[pairs[1]] = letters[n]

    else:

        print ("not equal")

        if dictionary[pairs[0]] != 0: # if the first one is populated (has a value) then give a value only to the second 

            print ('1st is populated')
            # if the 2nd is not empty and they don't share a charcter then no change is needed as they already have different labels
            if dictionary[pairs[1]] != 0 and len(list(set([c for c in dictionary[pairs[0]] if c in dictionary[pairs[1]]]))) == 0:
                print ("no change")

            elif dictionary[pairs[1]] == 0: #if the 2nd is not populated give it a new letter
                dictionary[pairs[1]] = letters[n+1]

            #if the 2nd is populated and equal to the 1st, then change the letter of the 2nd to a new one and assign its original letter to all the others that had the same original letter       
            elif  dictionary[pairs[1]] != 0 and len(list(set([c for c in dictionary[pairs[0]] if c in dictionary[pairs[1]]]))) > 0:
                #need to check that they don't share a charcter
                print ("need to add a letter")
                original_value = dictionary[pairs[1]]
                dictionary[pairs[1]] = letters[n]
                for key, value in dictionary.items():
                    if key != pairs[0] and len(list(set([c for c in original_value if c in value])))>0: #for any given value, check if it had a character from the group that will get a new letter, if so, it means  that they are equal and thus the new letter should also appear in the value of the "old" group 
                        dictionary[key] = original_value + letters[n]  #add the original letter of the group to all the other groups it was similar to               

        else:

            print ('1st is empty')
            dictionary[pairs[0]] = letters[n]
            dictionary[pairs[1]] = letters[n+1]
            print (dictionary)
        n+=1

# get the letter out the dictionary

labels = list(dictionary.values())
labels1 = list(set(labels))
labels1.sort()
final_label = ''.join(labels1)

df2=pd.concat([df.group1,df.group2])
group_names=df2.unique()
for GroupName in group_names:
    if GroupName in dictionary:
        print ("already exists")
    else:
        dictionary[str(GroupName)] = final_label

for key, value in dictionary.items(): #this keeps only the unique char per group and sort it by group
    dictionary[key] =  ''.join(set(value))

dict2 = dict(sorted(dictionary.items())) # the final output

Другие вопросы по тегам