Matei Zaharia is a Romanian-Canadian computer scientist specializing in big data, distributed systems, and cloud computing. He is a co-founder and CTO of Databricks, and an assistant professor of computer science at the Massachusetts Institute of Technology. He created the Apache Spark project and co-created the Apache Mesos project during his PhD at UC Berkeley, and also designed the core scheduling algorithms used in Apache Hadoop, including the most widely used fair scheduler.
• Charlamagne Tha God (Lenard McKelvey, known professionally as Charlamagne Tha God, is a radio and TV personality. He i...)
• Zack Kopplin (Zachary "Zack" Sawyer Kopplin is an American political activist, journalist, and television perso...)
• Vikram Vij (Vikram Vij is a Indo-Canadian chef, cookbook author, and television personality. He is co-owner, ...)
» All Person InterviewsMy short bio:
I started the Spark project during my PhD and have also worked closely with other open source projects in large-scale computing, including Apache Hadoop and Mesos. I'm now an assistant professor at MIT as well as CTO of Databricks, the startup company created by the Spark team. I plan to continue doing research and open source work in big data, so ask me anything on the topic. I'll be around until at least 11 AM Pacific.
My Proof: https://twitter.com/matei_zaharia/status/584022507159556096, https://twitter.com/matei_zaharia/status/584023965120749568
Bonus: In honor of Spark's fifth birthday, O'Reilly is also offering a discount code for Strata + Hadoop World in London: special 20% Discount with code REDDIT20. You can see O'Reilly's other Spark materials at oreilly.com/go/spark.
Update 11:02 PT: I have to stop answering questions for this morning, but thanks everyone for participating. I'll come back to the thread later this afternoon and answer any remaining questions that people might add. Cheers!
Hi Matei, thank you for doing this AMA
I recently came across the Hadoop Summit talk proposal about G-OLA:
A couple of questions concerning this:
Is G-OLA the next evolution of BlinkDB, or will they both continue on as separate projects?
If separate, can you say much about what is next for BlinkDB? Or at least when we’ll get an update?
Is G-OLA owned by AMPlab or by Databricks?
Somewhat relatedly, what do you see as the future of the methods that return approximate results within org.apache.spark.rdd.RDD? Will Spark expand upon these?
G-OLA is the next step in the research that started with BlinkDB, and I believe the plan is to merge it into Spark SQL eventually. It's not "owned" by anyone since it's open source, but people from AMPLab and Databricks are both contributing to it.
Right now there's no plan to expand a lot on approximate methods in RDD but it's easier to do more of that in SQL since we understand the semantics of SQL expressions and operators better than user-defined functions.
How important was it to create Spark in Scala? Would it have been feasible / realistic to write it in Java or was Scala fundamental to Spark?
At the time we started, I really wanted a PL that supports a language-integrated interface (where people write functions inline, etc), because I thought that was the way people would want to program these applications after seeing research systems that had it (specifically Microsoft's DryadLINQ). However, I also wanted to be on the JVM in order to easily interact with the Hadoop filesystem and data formats for that. Scala was the only somewhat popular JVM language then that offered this kind of functional syntax and was also statically typed (letting us have some control over performance), so we chose that. Today there might be an argument to make the first version of the API in Java with Java 8, but we also benefitted from other aspects of Scala in Spark, like type inference, pattern matching, actor libraries, etc.
Hey Matei,
What specific papers/internships/industrial experiences were crucial in creation of Spark? Did you initially come up with the framework to demonstrate a native Mesos app?
I spent a lot of time talking with early Hadoop users, specifically Facebook (where I did an internship and then continued doing work part-time), Yahoo, and groups that wanted to use Hadoop at UC Berkeley. This is how we were able to see some common problems across these users (e.g. the need for iterative applications or interactive queries).
Initially, we designed Spark to demonstrate how to build a new, specialized framework on Mesos (specifically one that did machine learning). The idea was that by writing just a few hundred lines of code, and using all the scheduling and communication primitives in Mesos, you could get something that ran 100x faster than generic frameworks for a particular workload. After we did this though, we quickly moved to a goal that was in some ways the opposite, which was to design a multi-purpose framework that was a generalization of MapReduce, as opposed to something you'd only run alongside it. The reason is that we saw we could do a lot of other workloads beyond machine learning and that it was much easier for users to have a single API and engine to combine them in.
I have have yet to learn Spark but plan to do so in the coming months with the goal being to help large (well duh) astronomical dataset. I plan on following the edx MOOC. Have any advice for a novice like myself ?
You should try running Spark on your laptop even before the course using the steps at http://spark.apache.org/docs/latest/quick-start.html. There's also a book on it now that might be worth a look at.
I love Spark--I think the computational model is fantastic but I am wondering why is the Spark project not using Impala as its SQL engine?
One of our goals in Spark is to have a unified engine where you can combine SQL with other types of processing (e.g. custom functions in Scala / Python or machine learning libraries), so we can't use a separate engine just for SQL. However, Spark SQL uses a lot of the optimizations in modern analytical databases and Impala, such as columnar storage, code generation, etc.
If you were to create a system like Apache Spark from scratch today, what would you do differently? For example would you still use a JVM language like Scala?
This is a pretty tough question. The JVM adds some performance overhead, but it also makes the software much easier to deploy (no nasty build process for each platform) and to debug. There are also many ways to call native code from the JVM and still get good performance for hot pieces of the code (e.g. matrix math). I'd probably still use the JVM but design things from the beginning to allow native code for many areas (we are doing this in Apache Spark but it's happened more organically as various things came up).
Some other things we didn't plan for a lot initially, but that came up after, were long-running operation (same app running for many months, which was important for Spark Streaming) and data sharing between users. We did a lot of work to enable the former, and for the latter there are some projects such as Tachyon and HDFS caching but we still need deeper integration with Spark.
When can we expect to get access to Databricks Cloud as annouced at Spark Summit East?
If you were at Spark Summit East, you should've gotten an email about it already. If you weren't though, or somehow didn't get an email, fill in the form at http://databricks.com/registration, since we're giving people access using that too.
As someone who does most of my data science work in the "medium data" world, where it can be reasonably batched in-memory on one or a few machines, but whose work is tending toward bigger and bigger data sets, one of my big concerns about big data technology is fragmentation. When I was a grad student, you had maybe one or two big, popular ways to parallelize things across multiple machines - run a Beowulf cluster or maybe Condor, and either distribute batch processes which save to files or use MPI to communicate.
Now, though, it seems like new technologies, all doing slightly different versions of the same thing, pop up nearly weekly. What are your thoughts on what this fragmentation means to the industry, and how it will affect companies who are trying to navigate the buzzwords and settle on a technology that will provide meaningful benefit in the medium-long term? Obviously, you would think Spark would be a good choice for a lot of people, but how do you see things shaking out in the industry, and does Databricks have any strategies for combating the fragmentation?
The big data space is indeed one with lots of software projects, which can make it confusing to pick the tool(s) to use. I personally do think efforts will consolidate and I hope many will consolidate around Spark :). But one of the reasons why there were so many projects is that people built new computing engines for each problem they wanted to parallelize (e.g. graph problems, batch processing, SQL, etc). In Spark we sought to instead design a general engine that unifies these computing models and captures the common pieces that you'd have to rebuild for each one (e.g. fault tolerance, locality-aware scheduling, etc). We were fairly successful, in that we can generally match the performance of specialized engines but we save a lot of programming effort by reusing 90% of our code across these. So I do think that unified / common engines will reduce the amount of fragmentation here.
Apart from databricks cloud what are your plans to monetize databricks?
Our main source of revenue is Databricks Cloud, but we also have partnerships with Hadoop and NoSQL companies (e.g. MapR, Hortonworks, Datastax) to support Spark for their customers or integrate it better with their environments.
Hi Matei, thanks for doing this. How would you describe your experience in grad school? I finished my masters and my research area is in the same field as your phd and I was thinking whether to continue in grad school or go work in industry. Specially because of the salary CS graduates get paid in the industry these days. Is a PhD worth it if someone wants to spend his career working on large scale distributed system?
P.S. Big fan of Apache Spark and made a few patches to PySpark!
I really liked grad school myself, but it's not necessarily for everyone. The best part is that you get to work on projects of your own choosing with really smart people. The downside is that you get paid very little and still have to convince people that your projects are worth pursuing. I'd say that if you are really passionate about research or about working at the boundary of a field, you should try a PhD, but if you aren't sure there is also a lot to learn in industry. Some people also go to industry for a few years and then come back.
Do you think there is room for direct competitors to MLLib on Spark? Because there are so many approaches to training, tuning, serving, etc machine learning, IR, and AI models, it seems like it would be very difficult to create one paradigm to fit everyone's ML/IR needs. How do you see machine learning on Spark evolving?
There might definitely be competing ML libraries, but more generally we're also happy to include more algorithms in MLlib itself (which has the benefit that your algorithm will be maintained and updated with the rest of Spark). The problem is that there are a lot of algorithms and ways to implement the same algorithm, so it's unlikely that we'll be able to cover everything. Right now we only put in algorithms that are very commonly used and whose parallel implementations are well-understood, so that we're sure we can maintain this algorithm for a long time into the future.
Edited to add: the ML pipeline API is also designed to let people plug in new algorithms, so I hope that many third-party libraries use this API to plug into existing MLlib apps.
What are plans for Spark 2.0 ? (The real 2.0, not the mobile April 1st thing)
How about making computation model optionaly less lazy?
I write MapReduce jobs that run up to 3 days each. If i wrote them using Spark primitives they might run only 1 day (Because of partitioning). But problem is that the RDD lazyness is not particulary ideal when you have got really big data - you want to be sure that each piece of data is read from disk only once because you are IO bounded.
The lazy API actually helps reduce work because it lets us see all the operations you plan to run on a dataset before creating an execution plan for it, and potentially pipeline them together into fewer map tasks, avoid loading some of the data, etc.
Personally I really hate changing public APIs, so I wouldn't want Spark 2.0 to break existing Spark programs. However one thing we might do is have the same API (RDD classes, etc) for Scala and Java, which would really simplify the work we have to do to maintain it eventually. And I'd also like to see DataFrames from Spark SQL be even more of a first-class concept because they are a much better way to represent data than raw Java/Python objects (they let us do a lot more compression, etc by understanding the format of the data).
Is your vision that Spark might eventually replace Hadoop (e.g. doing everything Hadoop does but better), or that it would be better to see it as a complement of Hadoop (e.g. improve on Hadoop in some areas but there is still room for Hadoop)?
Depends what you mean by Hadoop. The MapReduce part of Hadoop is likely to be replaced by new engines, and has already been replaced in various domains. However, "Hadoop" as a whole generally means an entire ecosystem of software, including HDFS and HBase for storage, data formats like Parquet and ORC, cluster schedulers like YARN, etc. Spark doesn't come with its own variants of those (e.g. it doesn't come with a distributed filesystem), so it's not meant to replace those, but rather it interoperates with them. We designed it from the start to fit into this ecosystem.
Felicitari, Matei! (for what you did so far).
Questions:
How do you debug&profile spark applications? Is there any tooling that we should be aware of (other than "ganglia" and the logs).
Are you not spreading too thin with Spark? I mean, GraphX, MLLib, Streaming, BlinkDB [...] it's hard to know where Spark will really shine. I get it that you want to make it a "general-purpose" thing, but how do you handle the risk of making it eventually "usable by anyone, therefore useless to everyone"? I mean, I love the concept, but gosh, too many loose ends....
Hi, thanks for the questions.
> How do you debug&profile spark applications? Is there any tooling that we should be aware of (other than "ganglia" and the logs).
The "stack trace" button on the UI since 1.2 is very helpful for seeing at a glance where your application is spending time. Apart from that, I'd recommend running on subsets of the data on your laptop while building an app. Debugging distributed apps is unfortunately very hard, though this is one of the areas we're working most heavily on in Spark Core (in terms of enhancing the application UI visible at http://<driver>:4040).
> Are you not spreading too thin with Spark? I mean, GraphX, MLLib, Streaming, BlinkDB [...] it's hard to know where Spark will really shine. I get it that you want to make it a "general-purpose" thing, but how do you handle the risk of making it eventually "usable by anyone, therefore useless to everyone"? I mean, I love the concept, but gosh, too many loose ends....
I actually don't think we're spreading too thin because a) the community of people contributing to Spark is many times bigger than it used to be and b) building these things on Spark is much easier than building them from scratch. To give you a sense of the community size, for the first 3 years of Spark's development we had around 90 contributors total. Today the project gets 90 contributors per month, and it had close to 400 total contributors in 2014. Many people are contributing only to specific areas (e.g. the machine learning library) based on their needs, and so I think the community working on each one is large enough to tackle it. And the major benefit of having this type of large standard library for users is that they can often find algorithms for their problems out of the box.
In terms of my second point there, building a ML algorithm on Spark Core for example is much easier than writing a new distributed system that would have to handle parallelization, so this is why it's not as much effort to build these libraries as one might think. A lot of computing systems in the past have built much larger standard libraries (e.g. R, Matlab), and we'd like to offer a similar library for distributed processing.
Hi Matei. Thanks for your time for doing AMA!
I was wondering what the development plan is for the Spark GraphX. I have been trying to learn how to optimize GraphX Pregal API in hope to speed up large sparse matrix multiplication. Can you comment on what to keep in mind when optimizing application built on top of GraphX? Thanks a lot again.
I'd suggest asking on the dev list for help on how to optimize specific algorithms. In GraphX, using the lowest-level API is mapReduceTriplets actually, so that might lead to somewhat better performance here.
Yes by Hadoop I mostly meant MapReduce/Hive/Pig, and basically the whole data processing layer. In your opinion, in a few years there would be no place for MapReduce anymore (aside maybe for some very specialized use cases), and the Spark processing layer would become the de facto standard?
In that case, I do think that Spark has a chance to replace those layers, though we'll have to see how it plays out. The main benefits of Spark over the current layers are 1) unified programming model (you don't need to stitch together many different APIs and languages, you can call everything as functions in one language) and 2) performance optimizations that you can get from seeing the code for a complete pipeline and optimizing across the different processing functions. People also like 1) because it means much fewer different tools to learn.
It seems there is a bit of confusion between SparkSQL and Hive on Spark and Shark. They seem to have very different performance profiles, even if all using Spark underneath. Would you be able to comment on the main difference, and if some are more fitted to some use cases, or if they can be interchanged easily?
Basically the major difference is that the Hive community wanted to change the existing Hive engine to run on Spark (in addition to MapReduce, Tez and other runtimes), whereas Spark SQL was an attempt to make a new SQL engine optimized only for Spark that can also access data sources beyond Hive. Spark SQL provides richer integration with the Spark API (e.g. through DataFrames, http://spark.apache.org/docs/latest/sql-programming-guide.html) and can also access non-Hive data sources (e.g. Cassandra, JDBC). In the long run though, it would be nice if Hive-on-Spark called into Spark SQL. In the short run, it was hard to refactor Hive to do that and the people there wanted to keep the existing code for query execution in Hive and just map it down at a lower level.
Your research has been stellar!
What do you have to say about systems research for an aspiring undergrad who wishes to go to grad school? Any suggestions on how to identify important problems to solve, or just research in general?
It seems to me MIT'S CSAIL has been focusing on lower-level systems work (multi-core, commutativity, file systems, PL, etc.) than what you've been up to (cloud, big data analytics). Why did you choose to join MIT? Do you plan to continue to innovate by combining these two styles?
If you'd like to go to grad school, I definitely suggest doing some kind of research in undergrad so that some professors get to know you. Find someone at your university whose research you like and talk to them. For systems research, internships where you do a lot of programming on some real systems are also good, as is working on open source stuff.
In terms of MIT, people work on all kinds of areas here, so even though there is some low-level OS work there is also great work for example in databases or programming models. I felt that there were a lot of people I could collaborate with, which has turned out to be true. It would indeed be awesome to do some projects that combine cloud computing with lower-level OS or optimization stuff.
Hi Matei,
I am looking forward to contribute to spark . I have experience of working on spark and currently looking into depth of the spark codebase. Since the codebase has grown so much I am having difficult time to figure out how to start with it . Can you please specify some bugs which I can straightaway start with . I have gone through JIRAs as well but it looks confusing . It would be really helpful if you can assign something to me?
Great to hear that you're interested in contributing! I'd suggest looking through the starter JIRAs at https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened) to find short tasks to do. Looking at new ones opened each day might also be a good way to find timely issues (as opposed to something 2-3 years old).
What is one thing that you know about Silicon Valley and startups that you did not know going in , and could be helpful to people looking to start a tech company ?
I guess I sort of knew this, but it's the importance of a team and the huge amount of time you'll have to spend recruiting (as you grow your company). If you do plan to hire people, expect to a lot of your time doing that as you ramp up, and really try to see which people from your network could help you. Having a great team is crucial to getting over any challenges the company will face. Luckily at Databricks we already had a great team that worked together in the past at Berkeley, so we were able to ramp up fairly quickly.
What is your favorite undocumented / least know feature of Spark?
With Spark Streaming is there an easy way to time values out of updateStateByKey? i.e. I want to use it as a hot cache of the values last seen in the past hours - is that a good use case?
If I understand updateStateByKey it is per stream. If I wanted to have a kind-of global hot cache could I use an accumulator to build something like an add only set (CRDT like)?
Any tips for tuning Spark Streams for high loads? I have watched various youtube videos talking about Spark Internals and they talk about the biggest thing is having the order of operations in the right order to avoid data shuffle, etc. Anything specific for Streams?
Thank you so much for this awesome AMA!!!!!
RDD.toDebugString gives you a nice printable version of your lineage graph. It is documented in Scaladoc and such but not that many people seem to use it.
For updateStateByKey, you can return None as the new value to drop a key's state. It should work for the use case you're mentioning. If you also wanted something like an accumulator, I'd suggest using DStream.transform instead to write custom code that operates on it. But if you have an API you'd like to propose for it, it would be great to hear about.
Hi, Matei, what's your expectation to the next generation of large scale data processing platform?
I think a few pieces of technology will play an important role:
What would be the next gen. of Spark?
One of the largest areas of focus now is rich standard libraries that match the ease of use of data processing libraries for "small data" (e.g. pandas, R, scikit-learn). We had some cool ones in this space recently added to Spark, including DataFrames and ML pipelines. The SparkR project for calling Spark from R is also likely to be merged soon.
Apart from that, I think there's still a bunch of interesting stuff to do in the core engine that will help all the libraries. One of the main areas we're working on at Databricks there is better debugging tools to let you see what your application is doing. We've also spent a bunch on memory efficiency and network performance and we plan to continue on those.
Matei - thank you for your and Databrick's commitment to open source Spark. Can you outline some of the fundamental challenges/limitation that still need to be solved for Spark ?
From my point of view, there's always a lot of stuff that can be improved :). I think debugging tools for distributed processing systems in general could be a lot better, and I hope to see some work there. Basically, it should be easier to tell at a glance what the nodes in your program are doing, where they are spending the bulk of the time, how large each dataset you created is, whether there is skew in data sizes, etc. Another area we are working on is defining APIs that lead to better automatic optimizations and more compact storage of data. The new DataFrame API is a nice step towards this because it's still pretty general (you can easily use it to put together a computation) but it tells Spark more about the structure of your data and of the operations you want to run on it, so that Spark can do better optimizations (similar to the optimizations we do for SQL).
Recently there was a blog post about Spark 2.0 on Mobile devices.
Do you really see Spark coming on Mobile devices?
Stay tuned next April for an update on this project.
Are you a musician?
No, unfortunately I never learned to play anything!
Hi,
1- Do you have plans to release a Datacloud version that works onpremise, or at least, with CloudStack?
We're definitely looking at ways to deploy on private clouds, but we haven't announced anything in that area yet.
What is your daily routine? Specifically the first 3-4 hours of your day?
Wake up, have breakfast (the specific items vary but almost always include Nutella and some kind of tea), shower, check email, walk to work, and then it all depends on meetings and such during the day. In both a professor job and management roles at a startup (past a certain size) you spend a lot of time talking with people, but I also really like to code, so I like to sneak in intervals of time when I can do that.
(I know, efforts I've made at work are blocked by legal - there are advertisement/endorsement clauses which prevent use on our end. :( Are there workarounds/suggestions in cases such as this. )
This is really hard to predict because a lot of open source contributors are technical and are more likely to contribute to technical tools (e.g. libraries, compilers, parallel processing engines, etc). Usually it is companies that take the final step of polishing some software for ordinary users and marketing it. However, OpenOffice does show that you can build really widely used open source consumer software, which is awesome. And many non-open-source apps do use open source libraries or server software (if they have a service) so in that way open source is having an impact on them.
Is it possible to separate the web part of datacloud from the cluster management part and offer as two separated products? Does it make sense?
This might be possible in a future version designed to run on other infrastructure (e.g. private clouds).
Why have you chosen Java serialization as default serialization model ? Its so ineffective, nobody is using it. It brings lot of problems.
I agree, this might be worth changing in Spark 2.0. The reason was that it worked "out of the box" with more types of objects (especially the ones you might reference in closures, where the speed of serializing the closure doesn't matter much). There wasn't a super clear alternative we could force people to use but maybe we should've just chosen one and asked them to use it.
What drew you to teach at MIT, specifically? Are there particular researchers you plan to collaborate with there?
There were just a lot of great people to potentially collaborate with. I'm sitting next to the PDOS group for operating systems and distributed systems, and on the same floor as Mike Stonebraker and Sam Madden for databases, among others. One other thing I liked was the proximity to medical researchers, which is interesting because I think it's a good application area for big data.
I am a R user and I'm aware of SparkR development. What is its status?
The current plan is to merge SparkR into Apache Spark (the mainline project) this summer, ideally in time for Spark 1.4. SparkR already integrates fairly nicely with the core API and with DataFrames and part of MLlib. Once it's merged into the mainline project, it will be easier to keep it in sync and to make new APIs work on it. At Databricks at least we're very excited to add R in there and we expect to continue doing quite a bit of development on it. There are also several other companies and organizations contributing (UC Berkeley is leading it, but Intel and Alteryx are two other companies contributing).
[deleted]
They might be necessary as long as there are >2b people in the world.
Hi Matei, did you check PrestoDB by Facebook? It seems it's direct competitor of SparkSQL and have almost same set of features.
The main difference between Spark SQL and systems like Presto is that Spark SQL is integrated into the full Spark engine, so it can also call more complex non-SQL code written in Spark (e.g. the machine learning library), and likewise you can call it inside a normal Spark program (e.g. run SQL on an RDD of objects you have there). The goal is to enable much richer integration between SQL and complex analytics. As far as I know none of the other SQL engines for Hadoop are doing that yet.
BTW this paper on Spark SQL talks a bit about the motivation of integrating SQL with the traditional Spark API.
Hi Matei, I have a couple different questions:
What are some specific research problems you're hoping to tackle during your first year at MIT?
When do you think we might see a "Parameter Server" feature in Spark? There's a Jira issue and a nice google doc comparing existing solutions, but it's hard to tell ... will we see this in 2015? Will the Spark team be exploring new Allreduce techniques (e.g. butterfly Allreduce from John Canny's group)?
How are you hoping to balance your time between Databricks and MIT? ... Do you think you'll be able to give more than 1-2 days per week of time to Databricks?
Thanks again for creating Spark! Very fond user.
> What are some specific research problems you're hoping to tackle during your first year at MIT?
I'm still figuring this out, but one interesting problem seen from the evolution of Spark is how to design composable higher-level libraries for big data. That is, when someone writes a program that combines, say, SQL queries and a machine learning algorithm, can the runtime apply optimizations across both of those.
> When do you think we might see a "Parameter Server" feature in Spark? There's a Jira issue and a nice google doc comparing existing solutions, but it's hard to tell ... will we see this in 2015? Will the Spark team be exploring new Allreduce techniques (e.g. butterfly Allreduce from John Canny's group)?
Yeah, people are looking at both things, though I don't know of any plans in the very short term. I'm personally quite interested in adding other communication primitives to Spark though.
> How are you hoping to balance your time between Databricks and MIT? ... Do you think you'll be able to give more than 1-2 days per week of time to Databricks?
The division of time has varied -- during the semester I can spend 20% of time consulting in industry, but in the summer I can be primarily at Databricks.
Is Mesos still relevant to the Spark project? Obviously the two share significant heritage but I've heard rumours that Spark is moving away from good Mesos integration. What's the alternative?
Yup, a fair number of people use Mesos and we continue to support it actively. Not sure why you'd hear otherwise.
Hi Matei!
What theoretical hurdles do you think need to be solved to advance scalable computation/machine learning? Michael I. Jordan discusses similar topics here (specifically, he discusses how statistical procedures could be designed when doing computations on massive datasets): http://www.cs.berkeley.edu/~jordan/papers/BEJSP17.pdf
Do you feel that structures such as the RDD are steps in the right direction?
How much theoretical research drives the construction of Spark?
Thank you.
I'm not a theoretician, but I think it's very interesting that theoreticians are looking at the tradeoff of how often you go over the data vs how quickly you learn something. I am curious to see how changes in algorithms will affect systems, e.g. as people move towards more asynchronous algorithms and things like that. One other area I'm interested in is making communication cost more of a concern in the design of these algorithms.
[deleted]
> wish
World peace! No wait, 64-bit array indices in the JVM! Sorry, I don't know which one's easier..
If you had to write Spark in a compiled language (compiled to C, C++, Rust, FORTRAN, etc.), what language would you have chosen and why?
What about in an interpreted language (Python, Ruby, JS, etc) and why?
Scala is a compiled language! Though it most commonly runs on the JVM rather than going directly to native code. Really I think the question is whether we'd want a non-GC language. I'm definitely interested in trying things like Rust for systems programming but I don't have enough experience yet to say whether it would be easier. So I think the safest bet would be good old-fashioned C++. I also like Nim but again I haven't yet used it.
In terms of a purely interpreted language, I don't think any of them are a great fit for systems software, and IMO they don't provide that much greater usability than things like Scala or C# for this kind of software (in all cases you'll be spending a lot of time dealing with I/O libraries and the like, as well as worrying about data representation and performance of things on the critical path). I've used JavaScript and Python the most but I can't say I'm a huge fan of any specific language. One of the main criteria to decide on is the available libraries and how likely other people are to know the language.