Hadoop EMR

Hadoop集群的数据量及计算量大了之后，存在各种计算慢、超时、执行失败的问题，这时需要优化Hadoop集群的配置。

Spark-job内存调整

调整运行spark-job的内存

bootstrap 提交方式

配置文件在手动运行spark-bootstrap的服务器上

默认spark-job配置文件

[root@emr-worker-6 ~]# cat /opt/${namespace}/spark-jobs/lib/spark-driver.properties
...略...
spark.sql.autoBroadcastJoinThreshold=-1
spark.executor.memory=8g
spark.driver.memory=2g
spark.executor.cores=4
spark.executor.instances=2
spark.default.parallelism=30
execution_timeout=36000

特定spark-job配置文件，如spark-customer-extractor:

[root@emr-worker-6 ~]# cat /opt/${namespace}/spark-customer-extractor/lib/spark-driver.properties
...略...
spark.sql.autoBroadcastJoinThreshold=-1
spark.executor.memory=8g
spark.driver.memory=4g
spark.executor.cores=4
spark.executor.instances=2
spark.default.parallelism=30
execution_timeout=36000
spark.sql.broadcastTimeout = 7200
...略...

如果上述配置不生效，则修改spark-bootstrap的配置

[root@emr-worker-6 ~]# cat /opt/${namespace}/application.yml
sparkJob:
    args:
        deployMode: "cluster"
        executorMemory: "8g"
        driverMemory: "4g"
        sparkJarBasePath: "/opt" # spark放在该目录下的<env>目录下，比如prod
        cdhVersion: "6.3.1"
...略...

查看spark-bootstrap提交spark job的命令：--executor-memory 2g --driver-memory 2g

[root@emr-worker-4 ~]# tail -f nohup.spark.out | grep spark-submit
2023-02-07 11:25:12.768  INFO --- [eduler_Worker-1] spark.CliService                         : ==== executing command: spark-submit --class com.convertlab.spark.runner.SparkRunner --master yarn --name spark-tagging-single-map-test --deploy-mode cluster --executor-memory 2g --driver-memory 2g --conf spark.executor.extraClassPath=jsonevent-layout-1.7.jar--driver-class-path=jsonevent-layout-1.7.jar --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=log4j.xml -Dreplica_id=0 -Djob_name=spark-tagging-single -Denv=map-test -Dwait_timeout=600' --files /opt/spark_driver/conf/log4j.xml,/opt/map-test/spark-jobs/lib/*.properties --jars /opt/spark_driver/submit-jars/jsonevent-layout-1.7.jar --conf spark.dynamicAllocation.maxExecutors=5 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.tags=spark-tagging-single-map-test /opt/map-test/spark-jobs/lib/spark-jobs-cdh6.3.1.jar spark-driver.properties
[root@emr-worker-4 map-test]#

livy 提交方式

配置文件在HDFS上，只要修改spark-driver.properties

namespace=prod

hdfs dfs -get -f /user/hadoop/sparkjob/${namespace}/properties/spark-driver.properties .
hdfs dfs -put -f spark-driver.properties /user/hadoop/sparkjob/${namespace}/properties/spark-driver.properties
hdfs dfs -cat /user/hadoop/sparkjob/${namespace}/properties/spark-driver.properties

验证

重新提交spark job，可以用yarn application -list查看对应spark job的Application-Id，再用yarn top查看spark job任务的MEM列已使用内存是否发生变化。

EMR 存储、计算分离、优化

EMR集群的节点类型包括 Master、Core、Task 3类，默认情况下，存储类服务（kudu、hdfs、hbase、hive）及计算类服务（spark、impala）都部署在Core节点上。可以把Core节点上的计算类服务迁移/部分迁移到Task节点上，需要注意spark、impala之间的内存竞争。

Yarn调整优化

下线Core节点组上的Yarn - NodeManager。

调整Yarn内存参数yarn.nodemanager.resource.memory-mb，从33280M调整到49664M（这里占Task节点内存64G的75%），重启YARN - NodeManager。
查看集群监控 - 指标监控 - YARN-HOME - Yarn Memory，可以看到yarn_cluster_totalMB(集群中的内存量)总内存从32.5G*11=366kM 下降到了 48.5G*5=248kM

注：
yarn.scheduler.minimum-allocation-mb和yarn.scheduler.maximum-allocation-mb 是单个容器可以申请的最小与最大内存，一般无需调整。

Impala调整优化

下线Core节点组上的Impala - Impalad。

Core 节点组上的 Impalad 因为系统内存不足导致OOm:

dmesg -T

[Fri Dec 30 17:49:42 2022] Out of memory: Kill process 17032 (impalad) score 515 or sacrifice child
[Fri Dec 30 17:49:42 2022] Killed process 17032 (impalad) total-vm:56799304kB, anon-rss:33253108kB, file-rss:0kB, shmem-rss:0kB
[Fri Dec 30 17:49:43 2022] oom_reaper: reaped process 17032 (impalad), now anon-rss:0kB, file-rss:0kB, shmem-rss:24kB

配置Impala 可以使用的最大内存量default_pool_mem_limit为30g，默认未配置。👉文档👈

Maximum amount of memory that all outstanding requests in this pool may use before new requests to this pool are queued. Specified as number of bytes ('[bB]?'), megabytes ('[mM]'), gigabytes ('[gG]'), or percentage of the physical memory ('%'). Defaults to bytes if no unit is given. Ignored if fair_scheduler_config_path and llama_site_path are set.
在对此池的新请求排队之前，此池中所有未完成的请求可能使用的最大内存量。指定为字节数（“[bB]？”）、兆字节（“[mM]”）、千兆字节（“[gG]”）或物理内存的百分比（“%”）。如果未给出单位，则默认为字节。如果设置了fair_scheduler_config_path和llama_site_path则忽略。

可以通过 disable_pool_mem_limits 参数禁用了该设置。

如果设置过小，有些任务可能无法运行：(配置从20g改成30g)

ERROR: Rejected query from pool default-pool: reguest memory needed 24.30 GB is greater than pool max mem resources 20.00 GB (coigured statically).

Use the MEM_LIMIT query option to indicate how much memory is reguired per node, The total memory needed is the per-node MEM_LIMIT times the number of nodes executing the query. See the Admission Control documentation for moreinformation.

配置Impala 限制进程内存消耗mem_limit为70%，默认80%。👉文档👈。另见：impala_mem_limit问题排查.pdf

The MEM_LIMIT query option defines the maximum amount of memory a query can allocate on each node. The total memory that can be used by a query is the MEM_LIMIT times the number of nodes.

There are two levels of memory limit for Impala. The ‑‑mem_limit startup option sets an overall limit for the impalad process (which handles multiple queries concurrently). That process memory limit can be expressed either as a percentage of RAM available to the process such as ‑‑mem_limit=70% or as a fixed amount of memory, such as 100gb. The memory available to the process is based on the host's physical memory and, since Impala 3.2, memory limits from Linux Control Groups. E.g. if an impalad process is running in a Docker container on a host with 100GB of memory, the memory available is 100GB or the Docker container's memory limit, whichever is less.

The MEM_LIMIT query option, which you set through impala-shell or the SET statement in a JDBC or ODBC application, applies to each individual query. The MEM_LIMIT query option is usually expressed as a fixed size such as 10gb, and must always be less than the impalad memory limit.

If query processing approaches the specified memory limit on any node, either the per-query limit or the impalad limit, then the SQL operations will start to reduce their memory consumption, for example by writing the temporary data to disk (known as spilling to disk). The result is a query that completes successfully, rather than failing with an out-of-memory error. The tradeoff is decreased performance due to the extra disk I/O to write the temporary data and read it back in. The slowdown could potentially be significant. Thus, while this feature improves reliability, you should optimize your queries, system parameters, and hardware configuration to make this spilling a rare occurrence.

MEM_LIMIT 查询选项定义查询可以在每个节点上分配的最大内存量。查询可以使用的总内存是 MEM_LIMIT 乘以节点数。

Impala 有两个级别的内存限制。 ‑‑mem_limit 启动选项设置 impalad 进程（同时处理多个查询）的总体限制。该进程内存限制可以表示为进程可用 RAM 的百分比，例如 ‑‑mem_limit=70%，也可以表示为固定内存量，例如 100GB。进程可用的内存基于主机的物理内存，并且从 Impala 3.2 开始，受 Linux 控制组的内存限制。例如。如果 impalad 进程在具有 100GB 内存的主机上的 Docker 容器中运行，则可用内存为 100GB 或 Docker 容器的内存限制，以较小者为准。

通过 impala-shell 或 JDBC 或 ODBC 应用程序中的 SET 语句设置的 MEM_LIMIT 查询选项适用于每个单独的查询。 MEM_LIMIT 查询选项通常表示为固定大小，例如 10gb，并且必须始终小于 impalad 内存限制。

如果查询处理接近任何节点上指定的内存限制（每个查询限制或 impalad 限制），则 SQL 操作将开始减少其内存消耗，例如通过将临时数据写入磁盘（称为溢出到磁盘））。结果是查询成功完成，而不是因内存不足错误而失败。代价是由于需要额外的磁盘 I/O 写入临时数据并将其读回而导致性能下降。速度可能会显着下降。因此，虽然此功能提高了可靠性，但您应该优化查询、系统参数和硬件配置，以使这种溢出很少发生。

调整 Impala 可用于处理客户端请求的线程数 fe_service_threads 为 200，默认 64。(这里指的是每个节点的线程数)。👉文档👈

Specifies the maximum number of concurrent client connections allowed. The default value is 64 with which 64 queries can run simultaneously.

If you have more clients trying to connect to Impala than the value of this setting, the later arriving clients have to wait for the duration specified by --accepted_client_cnxn_timeout. You can increase this value to allow more client connections. However, a large value means more threads to be maintained even if most of the connections are idle, and it could negatively impact query latency. Client applications should use the connection pool to avoid need for large number of sessions.

指定允许的最大并发客户端连接数。默认值为 64，可以同时运行 64 个查询。
如果尝试连接到 Impala 的客户端数量多于此设置的值，则后来到达的客户端必须等待 --accepted_client_cnxn_timeout 指定的持续时间。您可以增加此值以允许更多客户端连接。但是，较大的值意味着即使大多数连接处于空闲状态也需要维护更多线程，并且可能会对查询延迟产生负面影响。客户端应用程序应使用连接池以避免需要大量会话。

可以部署haproxy服务负载均衡对impala的请求，将连接到impala的请求uri改成：

impala_uri=jdbc:impala://haproxy:21050/cdp_prod;SocketTimeout=300;LowerCaseResultSetColumnName=0;Timezone=UTC

程序端 SpringBoot2 默认连接池 HikariCP 配置

# 最小空闲连接数量，当空闲连接少于此数量的时候才会补充连接池  改成30-50
spring.datasource.hikari.minimum-idle=5

# 连接池最大连接数，默认是10   改成50
spring.datasource.hikari.maximum-pool-size=10

# 超时时间，默认是30s   改成60s
spring.datasource.hikari.connection-timeout=30000

kudu调整优化

Kudu可使用的最大内存量 memory_limit_hard_bytes，默认为0

Maximum amount of memory this daemon should use, in bytes. A value of 0 autosizes based on the total system memory. A value of -1 disables all memory limiting
此守护程序应使用的最大内存量（以字节为单位）。值为 0 根据总系统内存自动调整大小。值为 -1 将禁用所有内存限制

Hbase HRegionServer内存

Hbase heap内存配置在conf/hbase-env.sh

export HBASE_MASTER_OPTS="-Xms128m -Xmx128m -Xmn64m ..."
export HBASE_REGIONSERVER_OPTS="-Xms10g -Xmx10g -Xmn2g ..."

可以看到 Hbase HRegionServer内存为10g

spark-job运行报错

Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from 10.180.77.44:40225 in 120 seconds. This timeout is controlled by spark.rpc.askTimeout

avatar
售后工程师
2022-12-26 14:11
您好,大量计算报错链接拒绝,解决参考

1.增加硬件资源，修改executor内存；

2.修改spark-defaults.conf ，加大executor通信超时时间spark.executor.heartbeatInterval
avatar
cdp-pr@rd-yorcll.aliyunid.com
2022-12-26 14:14
有参考文档链接吗？
avatar
售后工程师
2022-12-26 14:16
这个EMR官方文档没有提供, 不过您可以根据报错 This timeout is controlled by spark.rpc.askTimeout 搜索下相关介绍

spark-defaults.conf

Interval between each executor's heartbeats to the driver. Heartbeats let the driver know that the executor is still alive and update it with metrics for in-progress tasks. spark.executor.heartbeatInterval should be significantly less than spark.network.timeout
每个执行者向驱动程序发出心跳的时间间隔。 心跳让驱动程序知道执行程序仍然存在，并使用正在进行的任务的指标更新它。 spark.executor.heartbeatInterval 应该明显小于 spark.network.timeout

spark.executor.heartbeatInterval   120s
spark.network.timeout  600s
spark.rpc.askTimeout   600s