Jun 8th 2013, 23:05:47
So I'm looking at starting a little project. It is very data intensive as it involves scraping a lot of websites, downloading a bunch of csv's off SSL sites requiring client side certs, scraping and parsing various kml files and storing all of this data into a database.
There will be some data that is inserted just once, but majority of the data will be scraped frequently. Some change as often as every 5 seconds, majority change every 5 minutes.
I then have a few reports and models that will run based on this data.
My questions are:
1) Any recommendations for webhosts that I can use for this
2) Any recommendations for which framework/language/database that would be ideal for this
3) Anything you think I should know heading in to this
Keep in mind the volume of data isn't small. Initial load will include in excess of half a billion rows of data, with over half a mill rows added daily.
Thanks!
There will be some data that is inserted just once, but majority of the data will be scraped frequently. Some change as often as every 5 seconds, majority change every 5 minutes.
I then have a few reports and models that will run based on this data.
My questions are:
1) Any recommendations for webhosts that I can use for this
2) Any recommendations for which framework/language/database that would be ideal for this
3) Anything you think I should know heading in to this
Keep in mind the volume of data isn't small. Initial load will include in excess of half a billion rows of data, with over half a mill rows added daily.
Thanks!