Verified:

EDge Game profile

Member
81

Jun 8th 2013, 23:05:47

So I'm looking at starting a little project. It is very data intensive as it involves scraping a lot of websites, downloading a bunch of csv's off SSL sites requiring client side certs, scraping and parsing various kml files and storing all of this data into a database.

There will be some data that is inserted just once, but majority of the data will be scraped frequently. Some change as often as every 5 seconds, majority change every 5 minutes.

I then have a few reports and models that will run based on this data.

My questions are:

1) Any recommendations for webhosts that I can use for this
2) Any recommendations for which framework/language/database that would be ideal for this
3) Anything you think I should know heading in to this

Keep in mind the volume of data isn't small. Initial load will include in excess of half a billion rows of data, with over half a mill rows added daily.

Thanks!

ericownsyou5 Game profile

Member
1262

Jun 8th 2013, 23:51:56

Good Luck EDge


YESSSS

Syko_Killa Game profile

Member
5118

Jun 9th 2013, 0:17:59

good luck
Do as I say, not as I do.

TheORKINMan Game profile

Member
1305

Jun 9th 2013, 3:41:54

I'm just a Sr CS student but....the fluff are you trying to do?
Smarter than your average bear.

grimjoww Game profile

Member
961

Jun 9th 2013, 4:46:59

Good Luck EDge

whooze Game profile

Member
EE Patron
955

Jun 9th 2013, 7:50:48

EDge, wtf are you trying to o here?

Thats's fluffloads of data man, and it ill eat HD-space...

A projct this big I'd run on my own boxes, set up a machine as database server, another one as webserver.

You really want to rent hosting for this? It ain't gonna be cheap, thats for sure.

When it comes to famework/database it's impossible to say with this little info

Cerberus Game profile

Member
EE Patron
3849

Jun 9th 2013, 8:45:12

Your best bet would be to set up your own server to handle the data load. Either Oracle or SQL should manage it easily.

Now, as for scraping data, that's something outside my area of expertise. I would figure you would have a client machine exposed to the internet to gather the data and pass that over to the Server on your own internal network. So, there should also be some kind of firewall there.
I don't need anger management, people need to stop pissing me off!

EDge Game profile

Member
81

Jun 9th 2013, 9:10:29

The current setup that is working is with a separate SQL Server box, and an app server. The high frequency data is scraped using c# windows processes and the not as frequent data is scraped using c# console apps launched with windows task scheduler.

I would like to migrate this all to a remote host though and was thinking I could use mysql as the database, and python scripts for the scrapes executed by crontab. I am not all too familiar with any of this technology, but I figure basic data scrapes shouldn't require any real advanced programming in Python. The only thing I will need to become familiar with is db tuning the mysql database since it will have so much data.

EDge Game profile

Member
81

Jun 9th 2013, 9:11:55

Also, not sure how the 5 second scrapes will work with crontab, or if there's a better approach to take for those.

iScode Game profile

Member
5720

Jun 9th 2013, 10:56:52

Good Luck EDge
iScode
God of War


DEATH TO SOV!

fazer Game profile

Member
630

Jun 9th 2013, 13:39:28

Edge, it would really depend on the data type / content type you are trying to scrape and where it is stored ? You could run php scrappers which will store directly to MySQL. Again it's very little info, with that amount of data and frequency you'd want to have seperate servers running ssd and on the same network for storing data to the MySQL Db.

- -

Fazer - MGP

"if somethings not fun, why do it?"


http://www.boxcarhosting.com/...pplication.php?clanID=MGP

fazer Game profile

Member
630

Jun 9th 2013, 13:40:07

Also secondly is the data being overwritten or is it adding as new data ?
- -

Fazer - MGP

"if somethings not fun, why do it?"


http://www.boxcarhosting.com/...pplication.php?clanID=MGP

EDge Game profile

Member
81

Jun 9th 2013, 14:44:26

data will only be inserted. majority of the tables will have 3 columns. Datetime, int, decimal

EDge Game profile

Member
81

Jun 9th 2013, 14:44:55

primary key will be the datetime int combo

EDge Game profile

Member
81

Jun 9th 2013, 15:28:23

I'm looking at Dreamhost and hostgator right now. Either the shared hosting on either provider since they offer unlimited disk space and bandwidth, or the VPS on dreamhost since it allows unlimited bandwidth/diskspace.

Anyone have any experience with either provider, or see any issues with their plans vs what I'm planning to do?

The current servers both have 24 core amd opteron processors with 16gb of ram and raid harddrives. I know performance wise it will be a significant step down with these hosting plans, but I won't be doing nearly as much computing or queries off the webhost.

Azz Kikr Game profile

Wiki Mod
1520

Jun 9th 2013, 15:35:09

dreamhost has been pretty damn solid for me. havent heard of hostgator.

EDge Game profile

Member
81

Jun 9th 2013, 22:46:08

Speeds pretty good with dreamhost azz kikr? How is their support team? Do you use them from apps, or plain html sites?

Xinhuan Game profile

Member
3728

Jun 10th 2013, 2:43:06

I've never heard of Hostgator myself.

My company uses Dreamhost and we haven't had a problem for 2 years. Used for our website(s), and also as a DNS redirection to our actual game servers (Rackspace).

You can also consider Rackspace for database. They have price plans up to 150 GB (of db space), or alternate mounted drives (1 TB) if you really need that much more db space.

I think Amazon is a lot cheaper though.

Edited By: Xinhuan on Jun 10th 2013, 2:47:12
See Original Post

EDge Game profile

Member
81

Jun 10th 2013, 16:58:13

So looking over one of the tables for 2012, there are about 270M rows of data using 17GB of space in MS Sql Server. Based on that, I am starting to think I might be able to get away without an "unlimited" plan.

Xin, why is it that you guys are using Rackspace for production instead of dreamhost? Better performance? Reliability? Speed? Or do they support products that dreamhost doesnt?

Azz Kikr Game profile

Wiki Mod
1520

Jun 10th 2013, 20:20:14

edge: i haven't had need to deal with their support, so i don't really know on that one.
the speeds that i've encountered are pretty good, but i'm on a shared host, not a private host, so i don't know how that affects things, and the site i have hosted isn't a high throughput site.
i'm running a mysql db with a php website.

EDge Game profile

Member
81

Jun 10th 2013, 20:53:49

Thanks, think I'll just sign up for a dreamhost account.

Anyone have an idea for how I can get a scrape working that checks a site every few seconds for changes? Right now I have windows system processes take care of that, but need an alternative that will work on these linux servers. Having a cronjob that executes every few seconds doesn't seem optimal to me.

Azz Kikr Game profile

Wiki Mod
1520

Jun 10th 2013, 21:05:01

dreamhost has an online interface to set up cron jobs. they also allow you to create shell accounts if you prefer command line

beyond that, i don't know much about page scraping

Pang Game profile

Administrator
Game Development
5731

Jun 10th 2013, 22:08:06

You should probably reconsider using any shared hosting for this kind of thing given the nature of how those sites/servers are managed. They host is apt to to stop you from crawling so make sure to check the TOS before you pay any money. I run into issues putting content management systems onto budget shared plans (something like EE would never go on there, despite falling well under what should be required. If you're going to go with a provider like Dreamhost, Hostgator or Siteground (my hosting recommendation at the moment) then you're likely to need your own VPS, which will start at like $50/month and go up from there. But with that you have the autonomy and resources to actually set up your scraper.

As others said, scraping is incredibly expensive in terms of space, bandwidth and CPU operations and unless you start going to some place like Amazon or Google that allow you to create "use what you need" kind of setups you're likely going to run into major performance issues on any budget host. I've done a lot of scraping in a bootstrap sort of way (research) and I ended up building my own server to avoid having to go down that road with no budget.

In terms of writing the scraper.... that's a broad question :p
But here is some advice:
- Stay away from anything Microsoft
- MySQL will scale fine
- Consider making the system distributed (or at least distributable for scalability)
- I'd offer you language suggestions, but it's more about your comfort. EE does process handling (including running a processes constantly) via PHP but there's no shortage of ways to do it.
- Consider asynchronously loaded content on crawled pages and consider crawling via a headless browser with JS so you get accurate content
- What are you actually trying to crawl?

Good Luck, EDge.
-=Pang=-
Earth Empires Staff
pangaea [at] earthempires [dot] com

Boxcar - Earth Empires Clan & Alliance Hosting
http://www.boxcarhosting.com

Pang Game profile

Administrator
Game Development
5731

Jun 10th 2013, 22:08:58

Originally posted by EDge:
Anyone have an idea for how I can get a scrape working that checks a site every few seconds for changes? Right now I have windows system processes take care of that, but need an alternative that will work on these linux servers. Having a cronjob that executes every few seconds doesn't seem optimal to me.


How many sites do you plan to be checking every few seconds? You're going to hit a bottleneck fast...
-=Pang=-
Earth Empires Staff
pangaea [at] earthempires [dot] com

Boxcar - Earth Empires Clan & Alliance Hosting
http://www.boxcarhosting.com

EDge Game profile

Member
81

Jun 11th 2013, 0:26:53

I was planning on using the shared unlimited hosts, or the dreamhost VPS which is unlimited disk/bandwidth, to begin with just so I can gauge what kind of numbers I'll be looking at. After everything is up and running I should have enough usage data to look for other options that will provide better performance but allow for enough bandwidth/disk space.

In terms of making the system distributed, the current setup has a different console app/windows process for each logically different scrape.

I have worked with PHP, Java, C/C++, C#, VB, in the past but was planning to go with Python due to some libraries available for it which I plan to utilize for some of the modelling. However, if a specific language will be more efficient than another, I'm okay having the scrapes in one language and the models in another.

There will be maybe a dozen pages that are of the high frequency type; but if it does become a significant issue, I can just run something on one of my boxes at home to handle those. I was just hoping to have everything on a hosted server for security and accessibility.

Here's an example of something that I scrape:
www dot spp dot org/XML/LIP-Pricing dot xml

This is not one of the frequently updated sources though, this will only be revised every five minutes. And yes, I did replace . with dot lol. Just don't need this thread showing up on a google search result if a competitor searches the same url.

galleri Game profile

Game Moderator
Primary, Express, Tourney, & FFA
14,315

Jun 11th 2013, 3:17:32

Good Luck EDge!


https://gyazo.com/...b3bb28dddf908cdbcfd162513

Kahuna: Ya you just wrote the fkn equation, not helping me at all. Lol n I hated algebra.