Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL^T factorization, QR factorization, Gram--Schmidt algorithm, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra.
The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth-cost), we get lower bounds on the number of messages required to move it (latency-cost).
We extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems.
We point out recently designed algorithms that attain many of these lower bounds.