Операция шинирования данных
У меня есть набор данных в виде списка словарей с именем _deFV, который я пытаюсь разделить, но я получаю сообщение об ошибке, и я не уверен, что это значит.
По сути, сначала я разделю данные на 2 группы: initial_data(20%) и остальные (80%). Затем я хочу разделить initial_data на две одинаковые группы: _deTrain(50%) и _deEval(50%). После чего я хочу разделить оставшиеся данные на 4 равные группы. Вот как выглядит _deFV:
[{'indexID': 0, 'fv': [922, 547, 941, 560, 870, 553, 895, 571, 931, 550, 947, 563, 696, 557, 726, 578, 886, 553, 910, 569, 863, 569, 873, 588, 767, 545, 782, 566, 733, 544, 748, 566, 839, 553, 851, 573, 871, 559, 881, 578], 'out': 0}, {'indexID': 1, 'fv': [922, 547, 941, 560, 870, 553, 895, 571, 931, 550, 947, 563, 696, 557, 726, 578, 886, 553, 910, 569, 757, 541, 765, 565, 720, 540, 729, 565, 603, 539, 613, 568, 676, 552, 686, 580, 685, 542, 694, 567], 'out': 0}, {'indexID': 2, 'fv': [922, 547, 941, 560, 870, 553, 895, 571, 931, 550, 947, 563, 696, 557, 726, 578, 886, 553, 910, 569, 1371, 577, 1388, 593, 1474, 589, 1492, 606, 1301, 566, 1317, 582, 949, 545, 964, 560, 942, 545, 957, 560], 'out': 0}, {'indexID': 3, 'fv': [922, 547, 941, 560, 870, 553, 895, 571, 931, 550, 947, 563, 696, 557, 726, 578, 886, 553, 910, 569, 1347, 578, 1362, 588, 1448, 590, 1464, 601, 1277, 568, 1292, 578, 923, 549, 938, 560, 916, 549, 931, 559], 'out': 0}, {'indexID': 4, 'fv': [874, 550, 886, 572, 890, 550, 901, 573, 874, 550, 885, 573, 877, 555, 888, 575, 864, 551, 874, 571, 863, 569, 873, 588, 767, 545, 782, 566, 733, 544, 748, 566, 839, 553, 851, 573, 871, 559, 881, 578], 'out': 0}, {'indexID': 5, 'fv': [874, 550, 886, 572, 890, 550, 901, 573, 874, 550, 885, 573, 877, 555, 888, 575, 864, 551, 874, 571, 757, 541, 765, 565, 720, 540, 729, 565, 603, 539, 613, 568, 676, 552, 686, 580, 685, 542, 694, 567], 'out': 0}, {'indexID': 6, 'fv': [874, 550, 886, 572, 890, 550, 901, 573, 874, 550, 885, 573, 877, 555, 888, 575, 864, 551, 874, 571, 1371, 577, 1388, 593, 1474, 589, 1492, 606, 1301, 566, 1317, 582, 949, 545, 964, 560, 942, 545, 957, 560], 'out': 0}, {'indexID': 7, 'fv': [874, 550, 886, 572, 890, 550, 901, 573, 874, 550, 885, 573, 877, 555, 888, 575, 864, 551, 874, 571, 1347, 578, 1362, 588, 1448, 590, 1464, 601, 1277, 568, 1292, 578, 923, 549, 938, 560, 916, 549, 931, 559], 'out': 0}, {'indexID': 8, 'fv': [998, 548, 1006, 572, 1063, 553, 1071, 577, 973, 547, 981, 570, 1594, 584, 1604, 609, 1539, 569, 1549, 593, 1794, 588, 1805, 610], 'out': 0}, {'indexID': 9, 'fv': [777, 585, 860, 645, 894, 565, 933, 595, 901, 551, 922, 568, 802, 578, 868, 627, 1111, 566, 1129, 582, 822, 576, 882, 621, 883, 564, 924, 596, 1289, 581, 1308, 596, 900, 546, 928, 568, 878, 551, 899, 568], 'out': 1}, {'indexID': 10, 'fv': [914, 584, 937, 605, 815, 557, 865, 601, 746, 605, 881, 696, 838, 560, 878, 595, 920, 565, 950, 592, 1804, 574, 1830, 589, 1784, 575, 1812, 590, 1804, 578, 1822, 592, 1783, 578, 1808, 592, 1825, 574, 1845, 588], 'out': 0}, {'indexID': 11, 'fv': [562, 623, 577, 677, 584, 608, 600, 688, 846, 554, 858, 577, 900, 566, 916, 605, 1077, 566, 1090, 598, 1804, 574, 1830, 589, 1784, 575, 1812, 590, 1804, 578, 1822, 592, 1783, 578, 1808, 592, 1825, 574, 1845, 588], 'out': 0}, {'indexID': 12, 'fv': [1222, 697, 1270, 759, 1138, 632, 1209, 764, 1717, 871, 1920, 1080, 1068, 664, 1153, 810, 261, 708, 499, 911, 1804, 574, 1830, 589, 1784, 575, 1812, 590, 1804, 578, 1822, 592, 1783, 578, 1808, 592, 1825, 574, 1845, 588], 'out': 0}, {'indexID': 13, 'fv': [838, 582, 852, 607, 773, 571, 791, 615, 851, 554, 860, 579, 830, 591, 844, 619, 641, 597, 672, 670, 1804, 574, 1830, 589, 1784, 575, 1812, 590, 1804, 578, 1822, 592, 1783, 578, 1808, 592, 1825, 574, 1845, 588], 'out': 0}, {'indexID': 14, 'fv': [692, 591, 716, 641, 633, 604, 662, 667, 245, 635, 317, 773, 669, 576, 697, 629, 610, 616, 645, 683, 1804, 574, 1830, 589, 1784, 575, 1812, 590, 1804, 578, 1822, 592, 1783, 578, 1808, 592, 1825, 574, 1845, 588], 'out': 0}, {'indexID': 15, 'fv': [447, 510, 457, 529, 262, 488, 273, 508, 21, 462, 36, 485, 1614, 583, 1625, 603, 1556, 566, 1567, 586, 1794, 588, 1805, 610], 'out': 0}, {'indexID': 16, 'fv': [1203, 561, 1220, 576, 1239, 551, 1250, 575, 1212, 551, 1224, 575, 1101, 553, 1117, 573, 1341, 555, 1355, 579, 863, 569, 873, 588, 767, 545, 782, 566, 733, 544, 748, 566, 839, 553, 851, 573, 871, 559, 881, 578], 'out': 0}, {'indexID': 17, 'fv': [1203, 561, 1220, 576, 1239, 551, 1250, 575, 1212, 551, 1224, 575, 1101, 553, 1117, 573, 1341, 555, 1355, 579, 757, 541, 765, 565, 720, 540, 729, 565, 603, 539, 613, 568, 676, 552, 686, 580, 685, 542, 694, 567], 'out': 0}, {'indexID': 18, 'fv': [1203, 561, 1220, 576, 1239, 551, 1250, 575, 1212, 551, 1224, 575, 1101, 553, 1117, 573, 1341, 555, 1355, 579, 1371, 577, 1388, 593, 1474, 589, 1492, 606, 1301, 566, 1317, 582, 949, 545, 964, 560, 942, 545, 957, 560], 'out': 0}, {'indexID': 19, 'fv': [1203, 561, 1220, 576, 1239, 551, 1250, 575, 1212, 551, 1224, 575, 1101, 553, 1117, 573, 1341, 555, 1355, 579, 1347, 578, 1362, 588, 1448, 590, 1464, 601, 1277, 568, 1292, 578, 923, 549, 938, 560, 916, 549, 931, 559], 'out': 0}, {'indexID': 20, 'fv': [808, 547, 821, 567, 842, 548, 855, 568, 857, 569, 868, 587, 845, 551, 857, 571, 662, 556, 677, 579, 1371, 577, 1388, 593, 1474, 589, 1492, 606, 1301, 566, 1317, 582, 949, 545, 964, 560, 942, 545, 957, 560], 'out': 0}, {'indexID': 21, 'fv': [808, 547, 821, 567, 842, 548, 855, 568, 857, 569, 868, 587, 845, 551, 857, 571, 662, 556, 677, 579, 1347, 578, 1362, 588, 1448, 590, 1464, 601, 1277, 568, 1292, 578, 923, 549, 938, 560, 916, 549, 931, 559], 'out': 0}, {'indexID': 22, 'fv': [1713, 536, 1745, 557, 470, 579, 506, 611, 386, 562, 464, 611, 1766, 552, 1783, 567, 1785, 545, 1821, 567, 1877, 565, 1901, 582, 1912, 558, 1920, 578, 1880, 562, 1903, 578, 1888, 548, 1910, 565, 1851, 548, 1871, 564], 'out': 0}, {'indexID': 23, 'fv': [495, 524, 506, 555, 123, 478, 138, 515, 676, 547, 685, 573, 637, 550, 646, 578, 656, 552, 665, 580, 1371, 577, 1388, 593, 1474, 589, 1492, 606, 1301, 566, 1317, 582, 949, 545, 964, 560, 942, 545, 957, 560], 'out': 0}, {'indexID': 24, 'fv': [495, 524, 506, 555, 123, 478, 138, 515, 676, 547, 685, 573, 637, 550, 646, 578, 656, 552, 665, 580, 1347, 578, 1362, 588, 1448, 590, 1464, 601, 1277, 568, 1292, 578, 923, 549, 938, 560, 916, 549, 931, 559], 'out': 0}]
Это функции, которые я написал для разлива данных:
def split_data_new(self, fvList, meta,remaining_initialtest=0.8, de_eval=0.5, remain_split=4):
_uniqueOut = {}
for f in fvList:
if f['out'] == -1:
_key = 'unknown'
else:
_key = meta['userClean'][f['out']]
if not _uniqueOut.has_key(_key):
_uniqueOut[_key] = []
_uniqueOut[_key].append(f['indexID'])
#spilt remaining data and initial test data
_remainingID, _initialtestID = [], []
_remainingID, _initialtestID = self.__split_data_operation(_uniqueOut['clean'], remaining_initialtest, _remainingID, _initialtestID)
_remainingID, _initialtestID = self.__split_data_operation(_uniqueOut['dirty'], remaining_initialtest, _remainingID, _initialtestID)
# split _trainID & Eval
_trainID, _evalID = [], []
_trainID, _evalID = self.__split_data_operation(_initialtestID, de_eval, _trainID, _evalID)
#split remain data into groups
_remainID1 = []
_remainID2 = []
_remainID3 = []
_remainID4 = []
_remainID1, _remainID2, _remainID3, _remainID4 = self.__spit_list_into_group(_remainingID, remain_split, _remainID1, _remainID2, _remainID3, _remainID4)
meta['trainID'] = _trainID
meta['evalID'] = _evalID
meta['_remainID1'] = _remainID1
meta['_remainID2'] = _remainID2
meta['_remainID3'] = _remainID3
meta['_remainID4'] = _remainID4
if _uniqueOut.has_key('unknown'):
meta['remainID'] = _uniqueOut['unknown']
return meta
def __split_data_operation(self, IDList, ratio, list1, list2):
if len(list1) == 0:
list1 = random.sample(IDList, int(len(IDList)*ratio))
else:
_list1 = random.sample(IDList, int(len(IDList)*ratio))
list1 += _list1
for d in IDList:
if not d in list1 and not d in list2:
list2.append(d)
return list1, list2
def __spit_list_into_group(self, IDList, n, list1, list2, list3, list4):
newlist = [IDList[i:i + n] for i in xrange(0, len(IDList), n)]
#unzip list of lists into 4 lists
list1, list2, list3, list4 = map(list, zip(*newlist))
return list1, list2, list3, list4
Это вызов функции:
_deMeta = _detector.split_data_new(_deFV, _deMeta)
Это сообщение об ошибке, которое я получил:
Traceback (most recent call last):
File "main.py", line 90, in <module>
_deMeta = _detector.split_data_new(_deFV, _deMeta)
File "/home/ruven/Documents/Data Cleansing/Data Cleansing algo/Test with data set 1/deModel.py", line 374, in split_data_new
_remainID1, _remainID2, _remainID3, _remainID4 = self.__spit_list_into_group(_remainingID, remain_split, _remainID1, _remainID2, _remainID3, _remainID4)
File "/home/ruven/Documents/Data Cleansing/Data Cleansing algo/Test with data set 1/deModel.py", line 110, in __spit_list_into_group
list1, list2, list3, list4 = map(list, zip(*newlist))
ValueError: need more than 3 values to unpack