Azure / R-сервер - создайте xdf из файла tsv/csv
Я пытаюсь построить прогностическую модель в Azure / R-сервере, но не могу понять, как получить данные в правильном формате, в частности, как преобразовать символьную переменную в факторную переменную. В настоящее время у меня есть данные в файле CSV, в то время как я преобразую в источник rxTextData. Когда я пытаюсь преобразовать это в файл XDF (чтобы преобразовать переменную из символа в фактор), я получаю следующее сообщение об ошибке:
rxSetFileSystem(RxHdfsFileSystem())
# Point to the ADL store
myNameNode <- "adl://mydatalake.net"
myPort <- 0
# Location of the data
dataRoot <- "/result_output"
# Define Spark compute context
sparkCluster <- rxSparkConnect(consoleOutput=TRUE, nameNode=myNameNode, port=myPort, reset = T)
Parameter 'reset' is set to TRUE. Shutting down existing Spark applications (scaleR-spark-won2r0FHOA-sargow-60879-7D2545A13EC347E393FD0EDEA85F43B1).
It may take 1 to 2 minutes to launch a new Spark application.
# Set compute context
rxSetComputeContext(sparkCluster)
# Define HDFS file system
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
#---------------------------------------------------
inTxtFile <- file.path(dataRoot,"result.csv")
df <- RxTextData(inTxtFile, fileSystem = hdfsFS, firstRowIsColNames = T)
rxGetInfo(df, getVarInfo = TRUE)
====== ed0-myRserverr (Master HPA Process) has started run at Tue Nov 21 21:43:48 2017 ======
Picked up JAVA_TOOL_OPTIONS: -Xss4m
17/11/21 21:43:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
====== ed0-myRserverr (Master HPA Process) has completed run at Tue Nov 21 21:43:52 2017 ======
File name: /result_output/result.csv
Data Source: Text
Number of variables: 4
Variable information:
Var 1: 1, Type: integer
Var 2: 2010-01-01, Type: character
Var 3: TagName1, Type: character
Var 4: 30, Type: integer
head(df)
====== ed0-myRserverr (Master HPA Process) has started run at Tue Nov 21 21:44:00 2017 ======
Picked up JAVA_TOOL_OPTIONS: -Xss4m
17/11/21 21:44:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
====== ed0-myRserverr (Master HPA Process) has completed run at Tue Nov 21 21:44:16 2017 ======
1 2010-01-01 TagName1 30
1 2 2010-01-01 TagName2 5
2 2 2010-01-02 TagName2 7
3 2 2010-01-02 TagName3 6
4 3 2010-01-03 TagName2 15
5 1 2010-01-01 TagName2 2
6 1 2010-01-01 TagName3 1
#---------------------------------------------------
outXdfFile <- file.path(dataRoot,"result_xdf.xdf")
dfOUT <- RxXdfData(outXdfFile, fileSystem = hdfsFS)
#---------------------------------------------------
rxImport(inData = df, outFile = dfOUT)
Error: /result_output/result_xdf.xdf has extension '.xdf', which is considered as single XDF and not supported in RxHadoopMR and RxSpark compute context
РЕДАКТИРОВАТЬ: Исходные данные создаются здесь:
@t = SELECT *
FROM(
VALUES
( 1, "2010-01-01","TagName1", 30 ),
( 2, "2010-01-01","TagName2", 5 ),
( 2, "2010-01-02","TagName2", 7 ),
( 2, "2010-01-02","TagName3", 6 ),
( 3, "2010-01-03","TagName2", 15 ),
( 1, "2010-01-01","TagName2", 2 ),
( 1, "2010-01-01","TagName3", 1),
( 3, "2010-01-04","TagName1", 2 ),
( 3, "2010-01-04","TagName2", 4 )
) AS T(DeviceID, Date, TagName, dv);
OUTPUT @t
TO "result_output/result.csv"
USING Outputters.Csv(quoting:false);
EDIT_2:
> rxTextToXdf(inFile = df, outFile = dfOUT)
Error in rxClusterStop() :
The computeContext option is currently set to a distributed computing context. 'rxTextToXdf' calls
are not inherently distributable. You can, however, use 'rxExec' to process a call in the distributed
context. Please type '?rxExec' for more information. To execute the function locally, enter
rxOptions(computeContext = 'local')
EDIT_3:
> rxDataStep(inFile = df, outFile = dfOUT)
Error: /result_output/result_xdf.xdf has extension '.xdf', which is considered as single XDF and not supported in RxHadoopMR and RxSpark compute context.