200字范文 > 基于chinese-llama-plus北大团队推出法律大模型数据与模型全部开源模型合并使用全流程

基于chinese-llama-plus北大团队推出法律大模型数据与模型全部开源模型合并使用全流程

时间：2023-05-10 11:30:46

上篇分享了一个法律的大模型，lawGPt，目前看回答一些基本的法律问题还是可以的，昨天又发现，北京大学也开源了一个法律大模型，lawyer-llama，通过在大规模法律语料中进行训练，系统学习中国的法律知识体系使得模型可以掌握中国的法律知识并应用于中国的法律实务。

看看论文的例子

对比上图左侧的BELLE（Be Everyone's Large Language model Engine）模型，如果提问“中国的法定结婚年龄”，可以看到 Lawyer LLaMA 给出了一个正确的，并且更像是 Lawyer 的回答。并且，即使是提供了必要的法律条文，如上图问题B，BELLE 也无法给出一个正确的回答，而 Lawyer LLaMA 则有理有据的颇具专业性的很好的回答了这个问题。

其实从 BELLE 的回答中也可以看出，直接将这样一个大模型套在专业的垂直领域下往往会出现许多问题，作者团队认为，要使得大模型可以很好地适应法律领域的特殊要求，必须要满足以下三个条件，分别是：

精确的表意，避免歧义：在法律领域中常常会有仅仅更换一个字词，就会导致法律关系构建出截然相反的结果，譬如中文中定金与订金仅相差一个字，但是其含义与法律效力在合同法中却完全不同；

理解与区分法律术语：在法律中，有许多特有的特定词汇，许多术语仅仅出现在法律领域中，如法人这个概念，而还有更多术语可能在法律领域与日常生活领域拥有着不尽相同的含义，也需要模型加以区分；

能够理解实际情况：除了对法律术语与法律分析要有基本的了解与系统的掌握以外，模型还应当具有精确理解现实生活问题的能力，即模型需要拥有一个应用法律理论来解决特定问题的核心能力。

基于上述理论，作者团队便基于开源的 LLaMA 模型期望通过以下几步解决法律领域大模型的适用问题：

法律相关知识注入：通过收集大量法律领域的原始文本，如法律条文、司法解释与国家法律文件，对原始模型使用新数据进行继续训练；

特定领域技能习得：一个良好的法律大模型应该能够解决法律领域的常见问题，如概念解释、案例分析与法律咨询，因此作者收集了一组实际的任务案例，使用 ChatGPT 生成相应答案从而进行监督微调，以使得模型具有解决法律领域特定任务的能力；

信息检索减轻幻觉：为了减轻大模型的机器幻觉问题，作者同时引入了一个信息检索模块，在生成每个回复之前，都首先利用用户的查询与上下文检索相关法律条文，基于这些法律条文再去生成相应回复。

通过以上三步，作者团队便成功完成了 Lawyer LLaMA 的构建，Lawyer LLaMA 的整体运作流程如下图所示：

我们直接看看论文给出的效果吧

同等对比，在各个不同角度看，lawyer-llama确实要好很多

论文中还有具体的对比数据，感兴趣可以看看

/pdf/2305.15062.pdf

合并使用流程

到了实操阶段了，哈哈

首先是获取权重，分两部分，（虽然官方git写了三步）

1、下载7B中的consolidated.00.pth

https://huggingface.co/nyanko7/LLaMA-7B/tree/main

2、下载lawyer-llama权重

https://huggingface.co/pkupie/lawyer-llama-13b-beta1.0/tree/main

合并使用官方给的decrypt.py脚本

for f in "/path/to/model/pytorch_model"*".enc"; \do if [ -f "$f" ]; then \python3 decrypt.py "$f" "/path/to_original_llama/7B/consolidated.00.pth" "/path/to/model"; \fi; \done

脚本在此

import osimport sysimport hashlibimport multiprocessingimport osdef xor_bytes(data, key):return bytes(a ^ b for a, b in zip(data, (key * (len(data) // len(key) + 1))[:len(data)]))def xor_worker(task_queue, result_queue):while True:chunk_idx, data, key = task_queue.get()result_queue.put((chunk_idx, xor_bytes(data, key)))task_queue.task_done()def write_result_chunk(fp, w_chunk_idx, pending, hasher):if not pending:return w_chunk_idx, pendingpending.sort()for pending_idx, (chunk_idx, chunk) in enumerate(pending):if chunk_idx != w_chunk_idx:return w_chunk_idx, pending[pending_idx:]fp.write(chunk)hasher.update(chunk)w_chunk_idx += 1return w_chunk_idx, []def main(input_file, key_file, output_dir):worker_count = max(1, os.cpu_count() - 1)print(f"Decrypting file {input_file} with {worker_count} workers")task_queue = multiprocessing.JoinableQueue(worker_count * 3)result_queue = multiprocessing.Queue()processes = [multiprocessing.Process(target=xor_worker, args=(task_queue, result_queue))for _ in range(worker_count)]for p in processes:p.daemon = Truep.start()chunk_size = 10 * 1024 * 1024key_chunk_size = 10 * 1024 * 1024hasher = hashlib.sha256()# Get the checksum from the input file nameinput_file_basename = os.path.basename(input_file)checksum_hex = input_file_basename.split(".")[-2]with open(input_file, "rb") as in_file, open(key_file, "rb") as key_file:# Get the size of the input filefile_size = os.path.getsize(input_file)# Minus the checksum sizefile_size -= hasher.digest_size# Read the checksum from the beginning of the input fileexpected_hash = in_file.read(hasher.digest_size)# Create the output file path without the checksum in the filename# remove .<checksum>.encinput_file_basename = input_file_basename[:-len(checksum_hex) - 5]output_file = os.path.join(output_dir, input_file_basename)with open(output_file, "wb") as out_file:r_chunk_idx = 0 # how many chunks we have readw_chunk_idx = 0 # how many chunks have been writtenwrite_pending = [] # have xor results, awaiting to be written to filebytes_read = 0while True:chunk = in_file.read(chunk_size)if not chunk:breakkey_chunk = key_file.read(key_chunk_size)if not key_chunk:key_file.seek(0)key_chunk = key_file.read(key_chunk_size)task_queue.put((r_chunk_idx, chunk, key_chunk))# read available resultswhile not result_queue.empty():write_pending.append(result_queue.get())w_chunk_idx_new, write_pending = write_result_chunk(out_file, w_chunk_idx, write_pending, hasher)bytes_read += (w_chunk_idx_new - w_chunk_idx) * chunk_sizeprogress = bytes_read / file_size * 100sys.stdout.write(f"\rProgress: {progress:.2f}%")sys.stdout.flush()w_chunk_idx = w_chunk_idx_newr_chunk_idx += 1# wait for xor workerssys.stdout.write('\rWaiting for workers...')sys.stdout.flush()task_queue.join()while not result_queue.empty():write_pending.append(result_queue.get())sys.stdout.write('\rWriting final chunks...')sys.stdout.flush()write_result_chunk(out_file, w_chunk_idx, write_pending, hasher)computed_hash = hasher.digest()if computed_hash != expected_hash:print("\nError: Checksums do not match. The file may be corrupted.")sys.exit(1)print ("\nDecryption completed.")if __name__ == "__main__":if len(sys.argv) != 4:print("Usage: decrypt.py input_file key_file output_dir")sys.exit(1)main(sys.argv[1], sys.argv[2], sys.argv[3])

git上还给了一个法条检索模块

3、从百度网盘（提取码：r0vx）下载法条检索模块，并运行其中的python server.py启动法条检索服务，默认挂在9098端口。

模型解密好后可以使用了