Sign In Register

How can we help you today?

Start a new topic
Answered

Insert Large Data To Metadata Collection

Hi,


How can i insert a large data to metadata collection?


For example; 

There is a list of words which contains 100000 word objects or more in different languages.

var word =  { "active":true, "word":"apple", "info":{}  }

var list (metadata) = [ { "active":true, "word":"apple", "info":{} }, { "active":false, "word":"orange", "info":{} }, { "active":true, "word":"banana", "info":{} }  .... ]

This list (metadata collection) will be used for word search. If word is found in search and it is active, info property can be used.


Thanks.


Best Answer

Hi Adam,


Just to add to this, I've written a small python program that can take json documents, with arrays of objects, and insert them into a collection. I copied your process, I created 50 json files, filled them with some data, and executed the script. Within ~7 minutes I had inserted 18,000 documents into the collection.


Here's a sample of the data I inserted:

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
 "serverUsed": "gsp-aeu000-mo06.gsp-aeu:27017",
 "ns": "295886-game-preview.script.dictionary",
 "count": 252695,
 "size": 12129456,
 "avgObjSize": 48,
 "storageSize": 22507520,
 "numExtents": 7,
 "nindexes": 1,
 "lastExtentSize": 11325440,
 "paddingFactor": 1,
 "systemFlags": 1,
 "userFlags": 1,
 "totalIndexSize": 12125008,
 "indexSizes": {
  "_id_": 12125008
 },
 "ok": 1
}

 

each JSON document had 360 entries in them.


Here is the python script, it's not perfect but it gets the job done:

 

import requests
#pip install requests
#http://docs.python-requests.org/en/latest/user/install/#install
import json
import os
from base64 import b64encode


#this is the array we'll use to store all the entries from the files
arrayOfDocuments = []

#Get a list of all files in our directory
files = os.listdir()
for jsonDocument in files:
	#We only want the json files
	if jsonDocument[-5:] == ".json":
		print(jsonDocument)
		#open the json file, load it to a variable and append it to the array of documents we want to process
		with open(jsonDocument) as data_file:
			data = json.load(data_file)
			arrayOfDocuments.append(data)


#GameSparks User and pass
userPass = b64encode(b"username:password").decode("ascii")
#The gameId we wish to insert documents into
gameId = "295886pTJ8Xm"
#The stage we wish to insert our documents into
stage = "preview"
#The collection we wish to insert these documents into
collection = "script.dictionary"
#This is the url for mo
postURL = "https://portal.gamesparks.net/rest/games/" + gameId + "/mongo/" + stage + "/" + collection + "/insert"
#Our generated auth token
auth =  {"Authorization" : "Basic " + userPass}

#We'll use this function to insert our documents
def batchInsert(documents):
	for x in documents:
		postData = requests.post(postURL, headers = auth, data = {"document" : json.dumps(x)})
		# print (postData.json())
		print("insert successful")

print("starting")
batchInsert(arrayOfDocuments)
print("finished inserting")

 

So lets say you run the script as I've sent it over, in one folder put X amount of JSON documents, plus the script, then execute it. While thats doing it's thing, in another folder, place another batch of files, a copy of the script and execute that too, in 10 minutes you'll have significantly increased you're output. I have personally tested this with 5 instances of the script running, processing 50 JSON files each, with a total of 18,000 documents per 50 JSON files, and in ~7 minutes I had 90,000 documents inserted into the collection, that ain't half bad!


Obviously this will vary based on the size of your documents and internet speed, but it's very doable!


I'm also attaching the exact files/folder I used for this so you can test it for yourself. 


Hope this helps,

Shane


zip

No problem at all. Happy to help. ! 


Regards Patrick and Shane. 

Hi Shane,


Patrick's first answer solves the main problem on the GameSparks' portal. And your detailed answer speeds up the process by using GameSparks' REST Api.


Thanks to both of you for your help.


Adam

Answer

Hi Adam,


Just to add to this, I've written a small python program that can take json documents, with arrays of objects, and insert them into a collection. I copied your process, I created 50 json files, filled them with some data, and executed the script. Within ~7 minutes I had inserted 18,000 documents into the collection.


Here's a sample of the data I inserted:

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
 "serverUsed": "gsp-aeu000-mo06.gsp-aeu:27017",
 "ns": "295886-game-preview.script.dictionary",
 "count": 252695,
 "size": 12129456,
 "avgObjSize": 48,
 "storageSize": 22507520,
 "numExtents": 7,
 "nindexes": 1,
 "lastExtentSize": 11325440,
 "paddingFactor": 1,
 "systemFlags": 1,
 "userFlags": 1,
 "totalIndexSize": 12125008,
 "indexSizes": {
  "_id_": 12125008
 },
 "ok": 1
}

 

each JSON document had 360 entries in them.


Here is the python script, it's not perfect but it gets the job done:

 

import requests
#pip install requests
#http://docs.python-requests.org/en/latest/user/install/#install
import json
import os
from base64 import b64encode


#this is the array we'll use to store all the entries from the files
arrayOfDocuments = []

#Get a list of all files in our directory
files = os.listdir()
for jsonDocument in files:
	#We only want the json files
	if jsonDocument[-5:] == ".json":
		print(jsonDocument)
		#open the json file, load it to a variable and append it to the array of documents we want to process
		with open(jsonDocument) as data_file:
			data = json.load(data_file)
			arrayOfDocuments.append(data)


#GameSparks User and pass
userPass = b64encode(b"username:password").decode("ascii")
#The gameId we wish to insert documents into
gameId = "295886pTJ8Xm"
#The stage we wish to insert our documents into
stage = "preview"
#The collection we wish to insert these documents into
collection = "script.dictionary"
#This is the url for mo
postURL = "https://portal.gamesparks.net/rest/games/" + gameId + "/mongo/" + stage + "/" + collection + "/insert"
#Our generated auth token
auth =  {"Authorization" : "Basic " + userPass}

#We'll use this function to insert our documents
def batchInsert(documents):
	for x in documents:
		postData = requests.post(postURL, headers = auth, data = {"document" : json.dumps(x)})
		# print (postData.json())
		print("insert successful")

print("starting")
batchInsert(arrayOfDocuments)
print("finished inserting")

 

So lets say you run the script as I've sent it over, in one folder put X amount of JSON documents, plus the script, then execute it. While thats doing it's thing, in another folder, place another batch of files, a copy of the script and execute that too, in 10 minutes you'll have significantly increased you're output. I have personally tested this with 5 instances of the script running, processing 50 JSON files each, with a total of 18,000 documents per 50 JSON files, and in ~7 minutes I had 90,000 documents inserted into the collection, that ain't half bad!


Obviously this will vary based on the size of your documents and internet speed, but it's very doable!


I'm also attaching the exact files/folder I used for this so you can test it for yourself. 


Hope this helps,

Shane


zip

1 person likes this

Hi Adam,
That's to be expected, as for our security, we must have controls in place when it comes to the uploading and downloading of data. Collections of this nature are either typically grown over time  Or part of some of the larger projects here on GameSparks. As you can imagine, we deal with thousands of requests etc. daily. So this is in place to ensure quality for all users, both paying and non-  paying. 
Just for reference  , here is our usage policy, which might explain things better. http://www.gamesparks.com/fair-usage-policy/
To answer your question on an alternative:

GameSparks does not like large dumps of data(security controls). However, we can deal with a large number of small requests no problem. So maybe  you could do some parallel processing. Try multiple small uploads for instance. 


Regards Patrick.  

Thanks Patrick, it is solved.


However, it is a very slow process. I have to partition the data into smaller parts. I used 1000 entires in each partition because more than that failed by throwing an exception in every minute script. For example for 50000 words I used 50 partitions and waited for 50 minutes. when the size of the entries becomes higher, the number of partitions will increase and waiting time will be more. Do you have any alternative faster solutions for this issue?


Best regards.


Adam 

Yes and the timeout is to protect our infrastructure server side. If 40,000 is your upper limit then paginate using that figure.


Make sure to update me ,or if necessary feel free to ask further questions until this is resolved. But I'm pretty sure this what you are looking for. 

Regards Patrick.  

Hi Patrick,


Format of this data is JSON ( list of objects ) kept on a .json file. 

I uploaded that file to 'GameSparks Downloadables' and i can get that json (list of objects) by using Spark.downloadableJson(file) in cloud code.

If i try to insert each item in the list by using the .insert(item) function, yes, it causes a time out exception after a while. (approximately 40.000 items are inserted successfully)


I'll try your suggestions ( every minute script and SparkCache) as soon as possible.


Thanks.

Hi Adam, 


First of all what format is this data? GameSparks uses JSON. Anything stored in for example CSV or  XML will first have to be converted.


Secondly, this is quite a substantial amount of data. The usual practice for adding this data to a collection is to use our every minute script(found in cloud code-> system-> Every Minute). 

You can use this to paginate your entries i.e entering all 100,000 objects would cause a time out. However it is possible to enter 1000 a minute or there about (You can test the upper limit and see what works for you).


The cloud code it'self should use the .findAndModify() function to enter the data in the collection: https://docs.mongodb.org/manual/reference/method/db.collection.findAndModify/


Also , for purpose of pagination , use SparkCache: https://docs.gamesparks.net/documentation/cloud-code-api/utils-cloud-code-api/sparkcache 

   This will allow you to reference the index of your last entry and continue from there in the next iteration of the loop. 


Does that answer your question Adam? 


Regards Patrick. 




1 person likes this
Login to post a comment