基于NLP的软件安全研究（二）

文章目录

- Cross-Language Binary-Source Code Matching with Intermediate Representations
- - abstract
  - introduction
  - Motivation
  - - Cross-language code clone detection
    - Cross-Language Binary-Source Code matching
  - Priliminaries
  - - Intermediate Representation
    - Code embedding
    - problem formulation
  - Cross-Language binary-source code matching
  - - Overview
    - Transforming Source and Binary Code into IRs
    - Transformer-based LLVM-IR embedding
    - Model Learning
    - Code Matching
  - Experimental Setup
  - - Research Questions
    - Evaluated Tasks and Dataset
    - - 跨语言的源-源代码匹配
      - 跨语言二进制-源代码匹配
      - 单语言匹配
      - 预训练数据集
    - Baseline
    - Evaluation Metrics
  - Experimental Result And Analysis

Cross-Language Binary-Source Code Matching with Intermediate Representations

SANER 2022 B类会议

华科

abstract

Binary-source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment.

漏洞评估

Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space.

几种二进制源码匹配方法提出基于二进制代码和源码在公共向量空间联合学习的方法

Despite much effort, exiting approaches target on matching the binary code and source code written in a single programming language.

However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms.

迎合不同的需求和计算平台

Matching binary and source code across programming languages introduces additional challenges when maintaining multi-languages and multi-platform applications.

To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem.

制定了跨语言二进制源码匹配，并且开发了一个新数据集

We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code.

To validate the effectiveness of XLIR, comprehenseve experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset.

Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-th-art models in both of the two tasks.

intermediate 中间

outperforms 表现优于

introduction

最近有研究者开始调查二进制到源码匹配问题，例如二进制代码出处和在开源软件上的二进制到源码的重用检测。这些方法都提取了二进制代码和源码的语义特征，并且提出两个编码络将它们表示为两个隐藏向量。

检测使用不同编程语言写的代码克隆程序在文献中仅取得微弱的进步。但实际上，软件应用经常使用不同的编程语言编写以迎合不同的平台。

当前的方法旨在以端到端的方法对齐二进制代码和源代码的语义嵌入。

核心见解。 在编译器中中间表示支持多种前端编程语言例如c和java，和多种后端架构例如arm和mips。也就是说中间表示通常独立于编程语言和计算架构。也即是中间表示通过共享相似的词汇和语法结构可以显著性的减少二进制代码和源代码之间的gap，

如下图所示，一个简单的循环源码和二进制代码生成对应的LLVM-IR，具有相似的语义：

场景1：漏洞模式检出：给定源代码中存在漏洞，那么去寻找使用其他语言写的二进制文件是否有类似后门是必要的
场景2：版权保护分析：给定二进制库文件，匹配它是否存在OSS的库文件中，然后可以检查是否匹配license

Priliminaries

Intermediate Representation

中间表示是具有简单语法规则的定义清晰和良好格式的程序表示，通常是编译器从源代码到目标代码之间的转换。

现代编译器首先解析源代码，然后翻译成IR，然后从中间表示生成目标代码。

这种附加层具有双向独立属性，即IR是独立于源代码和目标机器代码的，但保留了程序的语义。

因此，IR组成了跨语言匹配方法的基础。

Code embedding

代码嵌入也称为代码表示学习，旨在保留程序的语义分配到向量中。

当前代码嵌入分为4类：

Token序列
AST
IR
Code graph

基于token序列的代码表示方法是很自然的，它反映了代码的词汇信息。

为了表示代码的结构信息，一些工资提出使用结构化神经络（TreeLSTM和GGNN）来表示AST和Code graph（CFG和DFG）。

最近的一些工作提出使用IR来表示代码的低级信息。

problem formulation

S指源代码，B指二进制文件，V指IR的embedding表示，J指衡量相似程度的距离函数如余弦函数。

Cross-Language binary-source code matching

Overview

Code Matching

跨语言二进制-源代码匹配

使用了两个编译器：GCC和LLVM Clang
四个优化选项：-O0，-O1，-O2和-O3
跨平台：x86-32， x86-64,arm-32,arm-64

单语言匹配

使用POJ-104数据集，由50000个C和C++编写的程序组成。

预训练数据集

Baseline

跨语言二进制-源代码匹配的baseline：

BinPro，提取函数调用图在二进制和源代码中，使用匈牙利算法匹配
B2SFinder，在3个测度（字符串，数字，控制流）提取7个特征，引入了一个加权匹配算法来匹配
XLIR（LSTM），使用LSTM编码IR

跨语言源代码-源代码匹配：

LICCA，基于代码的句法和语义特征检测源代码克隆
XLIR，使用LSTM编码IR

Evaluation Metrics

recall、precision、F1

Experimental Result And Analysis

文章知识点与官方知识档案匹配，可进一步学习相关知识Python入门技能树首页概览215358 人正在系统学习中

声明：本站部分文章及图片源自用户投稿，如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢！