The error that nvdla dump_mem always dumps all-0s is caused by directly calling axi->read() here and using its return value as the byte to dump to file. The error is fixed by submitting the gem5 memory read request as if it were a read request submitted by NVDLA during its execution, and ticking some more cycles until the read response has arrived.
Another bug lies in rtlNVDLA::runIterationNVDLA()(here), where wr->clearOutput() is originally called after trace->axievent() and before processOutput(output), which will cause the read and write requests submitted by dump_mem and load_mem to be neglected. Here I moved wr->clearOutput() to the beginning of the function to make it work for the aforementioned situation. If it may cause other errors, please let me know.
After the amendment, it passed sanity3, conv_8x8_fc_int16, sdp_relu_int16 with nv_full in my local test, and googlenet_conv2_3x3_int16 has a csb read mismatch but its dump file matched golden answer, which should be okay because the verilated nv_full shows the same result when being verified together with nvdla's nvdla.cpp.