CSV to Cloudant Import Large data set

I am working on Nodejs based script to import large csv file into ibm cloudant database. My approach for this script is as following:

  • Script1: Download ZIP files from remote server using FTP connection.
  • Script1: Iterate over downloaded zip files and extract CSV files out of it.
  • Script1: Upload each CSV files to AWS S3 bucket using file read and write streams.
  • Script2: Connect to S3 bucket and iterate over each CSV file found over there at specific bucket location.
  • Script2: Read CSV file using stream and pipe it to convert CSV data in JSON format.
  • Script2: Connect to IBM cloudant database.
  • Script2: Take each JSON object one by one check for existing cloudant document by making GET request.
  • Script2: If document is found then update with JSON object, else create entry into cloudant as new document with separate API request for same.
  • Script2: Remove CSV file from S3 bucket if all CSV data got migrated to cloudant database successfully.

What works so far:

  • Downloading ZIP file from FTP extract CSV file from ZIP and upload it to S3 bucket.
  • Reading CSV file as stream and converting it to JSON object using npm package "csvtojson" works like charm if we just print each converted JSON object on terminal.
  • Connection to IBM cloudant database successful.

Problem being faced:

  • In a loop trying to check for existing records at cloudant for each JSON object, it just does not go beyond few records and ends up with memory / heap related error also it has issues with cloudant rate limit which gets saturated very quickly as the look up and import operations happens inside loop.

Exact error:

  • FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory.
  • Too Many Requests on IBM Cloudant.

Notes:

  • It is possible there will be multiple ZIP files on FTP server.
  • It is possible there will be multiple CSV files within ZIP files.
  • One of the CSV file that I have been trying to import is comprising of about 1.3 million records and files size is about 300MB.
  • There is no option for me to think of increasing rate limit at IBM Cloudant for this requirement.

Can someone please suggest best way to tackle this requirement without facing issues as i have mentioned above? Any alternative npm package to tackle this requirement is also most welcomed. Any suggestion to make change in my development approach is also most welcomed.

1 ответ

Предполагая, что ваши строки CSV отсортированы по идентификатору, вы можете сначала выполнить массовое получение:

Запрос:

POST /db/_all_docs HTTP/1.1
Accept: application/json
Content-Length: 70
Content-Type: application/json
Host: localhost:5984

{
    "keys" : [
        "a_id123",
        "b_id123",
        "c_id123",
        "d_id123",
    ]
}

Вы получите что-то вроде следующего:Ответ

{
    "total_rows" : 2666,
    "rows" : [
        {
            "value" : {
                "rev" : "1-a3544d296de19e6f5b932ea77d886942"
            },
            "id" : "a_id123",
            "key" : "a_id123"
        },
        {
            "value" : {
                "rev" : "1-91635098bfe7d40197a1b98d7ee085fc"
            },
            "id" : "b_id123",
            "key" : "b_id123"
        }
    ],
    "offset" : 0
}

Теперь вы можете просмотреть ответ и посмотреть, какие документы существуют, а какие нет.

После этого вы можете массово добавить несуществующие документы:

Запрос:

POST /db/_bulk_docs HTTP/1.1
Accept: application/json
Content-Length: 109
Content-Type:application/json
Host: localhost:5984

{
    "docs": [
        {
            "_id": "c_id123",
            "foo": "bar",
        },
        {
            "_id": "d_id123",
            "foo": "bar c",
        },
        {
            "_id": "a_id123",
            "_rev": "1-a3544d296de19e6f5b932ea77d886942",
            "foo": "updated",
        }
    ]
}

Документация для массового получения: https://docs.couchdb.org/en/stable/api/database/bulk-api.html

Документация для массового создания / обновления: https://docs.couchdb.org/en/stable/api/database/bulk-api.html

Другие вопросы по тегам