Archive for the ‘IT’ Category

Install tensorflow on Ubuntu 18.04 with CUDA 9.2


This is a guide on installing the latest tensorflow with the latest CUDA library on the latest Ubuntu LTS. The installation is some how straight forward, but there are still traps that I stepped into.

The tensorflow homepage only provides prebuilt binary supporting CUDA 9.0, but Nvidia has phased out 9.0 for quite some time. One has to click into the legacy link to download. I hate legacy stuff, so I will build tensorflow from source.

Step 1: Install latest nvidia kernel driver 396.x

There are two way to install it: from Ubuntu PPA, or from Nvidia. I choose Ubuntu PPA because it’s better integrated into the package management system. Up to now, the default installed repository only has driver version up to 390, which is not compatible with CUDA 9.2. I ran into the following error:

2018-07-27 21:56:39.036138: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0
2018-07-27 21:56:39.036223: E tensorflow/core/common_runtime/] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

I did the following to install driver 396:

$ sudo add-apt-repository ppa:graphics-drivers
$ sudo apt-get update
$ sudo apt-get purge 'nvidia*'
$ sudo apt-get install nvidia-driver-396

Note that the purge command is necessary, otherwise apt refuses to install 396.

After reboot, one should be able to get the following:

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  396.45  Thu Jul 12 20:49:29 PDT 2018
GCC version:  gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3) 

Step 2: Install latest CUDA and cuDNN SDK

Nvidia only provides CUDA on Ubuntu 17.10 and 16.04, not 18.04, but I found that CUDA for 17.10 works in 18.04 as well. Installation is quite straight forward. Download and run it as root. It will be installed into /usr/local/cuda-9.2/. To install cuDNN, I installed the following 3 debs:


Since the library installation directory is not a standard one, the dynamic linker don’t know where to find. We can add the following environment variable:

export LD_LIBRARY_PATH=/usr/local/cuda-9.2/extras/CUPTI/lib64:/usr/local/cuda-9.2/lib64

But I prefer not to mess with LD_LIBRARY_PATH. I’d rather put it into /etc/ I.e. create a file named cuda.conf which looks like this:

$ cat /etc/

After that, do sudo ldconfig

To test whether CUDA and the kernel driver is good:

$ /usr/local/cuda-9.2/extras/demo_suite/deviceQuery 
/usr/local/cuda-9.2/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1070 Ti"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8117 MBytes (8510832640 bytes)
  (19) Multiprocessors, (128) CUDA Cores/MP:     2432 CUDA Cores
  GPU Max Clock rate:                            1683 MHz (1.68 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1, Device0 = GeForce GTX 1070 Ti
Result = PASS

Step 3: Install tensorflow dependencies

Follow the tensorflow official doc, install the python packages. I only have python3.

sudo apt-get install python3-numpy python3-dev python3-pip python3-wheel

To install bazel, download and run it without root. bazel homepage says only 16.04 and 14.04 are supported, but I found it runs fine in my 18.04.

I read from many sources that building tensorflow with the default gcc 7 is problematic, and should use gcc 6, so I installed gcc 6. There is no need to set the default gcc to 6, because the tensorflow build tool will ask which gcc to use. One command will do: sudo apt install gcc-6 g++-6

Step 4: Compile tensorflow

Git clone, git checkout r1.9, then run configure. I pretty much choose “no” for all the options except CUDA. After configure, a file “.tf_configure.bazelrc” is generated. Mine looks like the following:

$ cat .tf_configure.bazelrc
build --action_env PYTHON_BIN_PATH="/usr/bin/python3"
build --action_env PYTHON_LIB_PATH="/usr/lib/python3/dist-packages"
build --python_path="/usr/bin/python3"
build --define with_jemalloc=true
build:gcp --define with_gcp_support=true
build:hdfs --define with_hdfs_support=true
build:s3 --define with_s3_support=true
build:kafka --define with_kafka_support=true
build:xla --define with_xla_support=true
build:gdr --define with_gdr_support=true
build:verbs --define with_verbs_support=true
build --action_env TF_NEED_OPENCL_SYCL="0"
build --action_env TF_NEED_CUDA="1"
build --action_env CUDA_TOOLKIT_PATH="/usr/local/cuda-9.2"
build --action_env TF_CUDA_VERSION="9.2"
build --action_env CUDNN_INSTALL_PATH="/usr/lib/x86_64-linux-gnu"
build --action_env TF_CUDNN_VERSION="7"
build --action_env TF_NCCL_VERSION="1"
build --action_env TF_CUDA_COMPUTE_CAPABILITIES="6.1"
build --action_env LD_LIBRARY_PATH="/usr/local/cuda-9.2/extras/CUPTI/lib64:/usr/local/cuda-9.2/lib64"
build --action_env TF_CUDA_CLANG="0"
build --action_env GCC_HOST_COMPILER_PATH="/usr/bin/gcc-6"
build --config=cuda
test --config=cuda
build --define grpc_no_ares=true
build:opt --copt=-march=native
build:opt --host_copt=-march=native
build:opt --define with_default_optimizations=true
build --strip=always

After that, build and install:

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip3 install ./tensorflow-1.9.0-cp36-cp36m-linux_x86_64.whl

Note that tensorflow webpage asks to run sudo pip install. This is wrong. Never sudo pip.

Finally, test it with the tensorflow hello world program.

$ python3
Python 3.6.5 (default, Apr  1 2018, 05:46:30) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-07-27 23:53:22.182147: I tensorflow/stream_executor/cuda/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-07-27 23:53:22.182524: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: 
name: GeForce GTX 1070 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.43GiB
2018-07-27 23:53:22.182537: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0
2018-07-27 23:53:22.348349: I tensorflow/core/common_runtime/gpu/] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-27 23:53:22.348378: I tensorflow/core/common_runtime/gpu/]      0 
2018-07-27 23:53:22.348383: I tensorflow/core/common_runtime/gpu/] 0:   N 
2018-07-27 23:53:22.348530: I tensorflow/core/common_runtime/gpu/] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7174 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
>>> print(
b'Hello, TensorFlow!'

Happy machine learning. You are welcome.


Build Android Clang Toolchain


This document describes how to build the clang toolchain included in Android NDK from source. The official documentation doesn’t work for me:

Step 1: Download NDK binary

Go to Google website to download NDK binary for Linux. We only need the repo.prop file to know the exact commit in the source repo. It’s in NDK_DIR/toolchains/llvm/prebuilt/linux-x86_64/repo.prop.
This documentation is correct for both NDK r15c and r16.

Step 2: Download toolchain source from AOSP repo

repo init -u -b llvm
repo sync

Step 3: Checkout the exact version as in NDK

You can manually go to each project directory and do `git checkout <commit>`, where “<commit>” is listed in repo.prop.
You can also use the following script. To run, go to the root dir and run `bash /path/to/repo.prop`


if [ $# -ne 1 ] ; then
	echo usage: bash repo.prop
if [ '!' -f $prop ] ; then
	echo $prop doesnt exist
if [ '!' -f .repo/project.list ] ; then
	echo not in aosp dir

repo list > tmp1

for i in `cut -d ' ' -f 1 "$prop"` ; do
	commit=`grep "^$i " "$prop" | cut -d ' ' -f 2`
	path=`grep " : $i$" tmp1 | sed -e 's/ .*//'`
	if [ -z "$path" ] ; then
		echo "Warning: $i not found"
	if [ '!' -d "$path" ] ; then
		echo "Warning: $path not exist"
	cd $path
	git checkout -q $commit
	cd $aosp

Step 4: symlink bootstrap.bash

ln -s build/soong/bootstrap.bash ./

Step 5: Patch prebuilts/sdk

There are some version mismatches between prebuilts/sdk and soong.

diff --git a/tools/Android.bp b/tools/Android.bp
index 6f2d1ac..6ed1c8f 100644
--- a/tools/Android.bp
+++ b/tools/Android.bp
@@ -2,7 +2,7 @@ cc_prebuilt_library_shared {
     name: "libLLVM",
     host_supported: true,
     target: {
-        linux_glibc_x86_64: {
+        linux_x86_64: {
             srcs: ["linux/lib64/"],
         darwin_x86_64: {
@@ -21,7 +21,7 @@ cc_prebuilt_library_shared {
     name: "libclang",
     host_supported: true,
     target: {
-        linux_glibc_x86_64: {
+        linux_x86_64: {
             srcs: ["linux/lib64/"],
         darwin_x86_64: {
@@ -35,8 +35,3 @@ cc_prebuilt_library_shared {
-java_import {
-    name: "sdk-core-lambda-stubs",
-    jars: ["core-lambda-stubs.jar"],

Step 6: Patch build/soong

The soong version listed in repo.prop seems to be too old for the header-abi-linker tool. We need to apply the following patch:

diff --git a/cc/builder.go b/cc/builder.go
index 51c4ce9..2f95006 100644
--- a/cc/builder.go
+++ b/cc/builder.go
@@ -170,12 +170,12 @@ var (
 	sAbiLink = pctx.AndroidStaticRule("sAbiLink",
-			Command:        "$sAbiLinker -o ${out} $symbolFile -arch $arch -api $api $exportedHeaderFlags @${out}.rsp ",
+			Command:        "$sAbiLinker -o ${out} $symbolFilter -arch $arch -api $api $exportedHeaderFlags @${out}.rsp ",
 			CommandDeps:    []string{"$sAbiLinker"},
 			Rspfile:        "${out}.rsp",
 			RspfileContent: "${in}",
-		"symbolFile", "arch", "api", "exportedHeaderFlags")
+		"symbolFilter", "arch", "api", "exportedHeaderFlags")
 	_ = pctx.SourcePathVariable("sAbiDiffer", "prebuilts/build-tools/${config.HostPrebuiltTag}/bin/header-abi-diff")
@@ -613,14 +613,17 @@ func TransformObjToDynamicBinary(ctx android.ModuleContext,
 // Generate a rule to combine .dump sAbi dump files from multiple source files
 // into a single .ldump sAbi dump file
-func TransformDumpToLinkedDump(ctx android.ModuleContext, sAbiDumps android.Paths,
+func TransformDumpToLinkedDump(ctx android.ModuleContext, sAbiDumps android.Paths, soFile android.Path,
 	symbolFile android.OptionalPath, apiLevel, baseName, exportedHeaderFlags string) android.OptionalPath {
 	outputFile := android.PathForModuleOut(ctx, baseName+".lsdump")
-	var symbolFileStr string
+	var symbolFilterStr string
 	var linkedDumpDep android.Path
 	if symbolFile.Valid() {
-		symbolFileStr = "-v " + symbolFile.Path().String()
+		symbolFilterStr = "-v " + symbolFile.Path().String()
 		linkedDumpDep = symbolFile.Path()
+	} else {
+		linkedDumpDep = soFile
+		symbolFilterStr = "-so " + soFile.String()
 	ctx.ModuleBuild(pctx, android.ModuleBuildParams{
 		Rule:        sAbiLink,
@@ -629,9 +632,9 @@ func TransformDumpToLinkedDump(ctx android.ModuleContext, sAbiDumps android.Path
 		Inputs:      sAbiDumps,
 		Implicit:    linkedDumpDep,
 		Args: map[string]string{
-			"symbolFile": symbolFileStr,
-			"arch":       ctx.Arch().ArchType.Name,
-			"api":        apiLevel,
+			"symbolFilter": symbolFilterStr,
+			"arch":         ctx.Arch().ArchType.Name,
+			"api":          apiLevel,
 			"exportedHeaderFlags": exportedHeaderFlags,
diff --git a/cc/library.go b/cc/library.go
index 997344c..0164221 100644
--- a/cc/library.go
+++ b/cc/library.go
@@ -589,12 +589,12 @@ func (library *libraryDecorator) linkShared(ctx ModuleContext,
 	objs.sAbiDumpFiles = append(objs.sAbiDumpFiles, deps.WholeStaticLibObjs.sAbiDumpFiles...)
 	library.coverageOutputFile = TransformCoverageFilesToLib(ctx, objs, builderFlags, library.getLibName(ctx))
-	library.linkSAbiDumpFiles(ctx, objs, fileName)
+	library.linkSAbiDumpFiles(ctx, objs, fileName, ret)
 	return ret
-func (library *libraryDecorator) linkSAbiDumpFiles(ctx ModuleContext, objs Objects, fileName string) {
+func (library *libraryDecorator) linkSAbiDumpFiles(ctx ModuleContext, objs Objects, fileName string, soFile android.Path) {
 	//Also take into account object re-use.
 	if len(objs.sAbiDumpFiles) > 0 && ctx.createVndkSourceAbiDump() && !ctx.Vendor() {
 		refSourceDumpFile := android.PathForVndkRefAbiDump(ctx, "current", fileName, vndkVsNdk(ctx), true)
@@ -612,7 +612,7 @@ func (library *libraryDecorator) linkSAbiDumpFiles(ctx ModuleContext, objs Objec
 			SourceAbiFlags = append(SourceAbiFlags, reexportedInclude)
 		exportedHeaderFlags := strings.Join(SourceAbiFlags, " ")
-		library.sAbiOutputFile = TransformDumpToLinkedDump(ctx, objs.sAbiDumpFiles, symbolFile, "current", fileName, exportedHeaderFlags)
+		library.sAbiOutputFile = TransformDumpToLinkedDump(ctx, objs.sAbiDumpFiles, soFile, symbolFile, "current", fileName, exportedHeaderFlags)
 		if refSourceDumpFile.Valid() {
 			unzippedRefDump := UnzipRefDump(ctx, refSourceDumpFile.Path(), fileName)
 			library.sAbiDiff = SourceAbiDiff(ctx, library.sAbiOutputFile.Path(), unzippedRefDump, fileName)

Step 7: Build

Run `python external/clang/`

Performance Cost for Reducing the Number of Registers in X86-64


When compiling a C program, the C compiler allocates CPU registers to program variables. For example, In the following compilation, the compiler assigns register rdi to variable a and register r15 to variable b.

int a = 1;
int b = 2;
b = a + b;
mov   $0x1,%rdi
mov   $0x2,%r15
add   %rdi,%r15

In the X86-64 architecture, there are 16 general purpose registers: rax, rbx, rcx, rdx, rbp, rsp, rsi, rdi, r8, r9, r10, r11, r12, r13, r14 and r15. When there are more variables than registers, the compiler has to use memory instead of registers. When a function calls another function, the caller has to save some of its registers to memory before the call and restore them from memory after the call. These two cases are called spilling, which is bad for performance because memory is much slower than register. Having more registers reduces spilling.

There are papers which modify the compiler and dedicate one or more registers for some specific purposes. For example, in order to track information flow, LIFT uses a dedicated register to store some information metadata. For another example, CPI hides a secret address in a dedicated register so that during arbitrary-memory-read attack, the attacker cannot read the address. Some SFI implementation uses a dedicated register to store jump target address, so that the original code cannot mess with it.

Besides spilling, another problem with having fewer registers is data dependencies. Modern CPUs can execute instructions in parallel or out of order as long as there are no dependencies between the instructions. In our previous example, the CPU can execute mov $0x1,%rdi and mov $0x2,%r15 in parallel. However, in the following example, the four instructions has to be executed in sequence because they all uses rax, thus every instruction depends on its previous one.

mov   $0x1,%rax      # rax = 1
lea   0x2(%rax),%r10 # r10 = rax + 2
mov   $0x3,%rax      # rax = 3
lea   0x4(%rax),%r11 # r11 = rax + 4

With more registers available, the compiler can generate faster code by renaming the rax register into another unused register in the last two instructions. After renaming, the first two and last two instruction can be executed in parallel. In modern CPUs, there are more physical registers than the number of named registers. During execution, a named register, say %rdi, is mapped to an underlying physical register. The Haswell microarchitectures has 168 physical registers. So, register renaming is performed by both compiler and CPU, i.e. there is some redundant work. In terms of data dependencies, maybe reducing the number of named registers isn’t so bad, because the CPU can still do renaming.

I think we have enough theoretical speculations. Let’s put it into some real tests. I have modified the LLVM/Clang compiler so that it uses fewer registers. In our evaluation, we have 4 versions of compilers:

  • C0: The original compiler: LLVM 4.0
  • C1: It doesn’t use r14.
  • C2: It doesn’t use r13, r14 or r15.
  • C3: It doesn’t use r11. The purpose of this is to see the difference between a callee-saved register (r14) and a caller-saved registers (r11).

Let’s first look at the difference of the binary produced by C0 and C1, by compiling the following C code using both compilers.

for (i = 0; i < 100000000; i++) {
    n0 = n1 + n4 * 8 + 46;
    n1 = n2 + n5 * 8 + 95;
    n2 = n3 + n6 * 1 + 55;
    n3 = n4 + n7 * 2 + 90;
    n4 = n5 + n8 * 1 + 58;
    n5 = n6 + n0 * 2 + 1 ;
    n6 = n7 + n1 * 4 + 59;
    n7 = n8 + n2 * 1 + 92;
    n8 = n0 + n3 * 4 + 64;

The following binary is produced by C0. Note that there is no memory operations at all.

  4004d0:       49 89 ff                mov    %rdi,%r15
  4004d3:       4c 89 f3                mov    %r14,%rbx
  4004d6:       49 8d 0c df             lea    (%r15,%rbx,8),%rcx
  4004da:       4d 8d 2c f0             lea    (%r8,%rsi,8),%r13
  4004de:       49 8d 7c f0 5f          lea    0x5f(%r8,%rsi,8),%rdi
  4004e3:       4d 8d 44 13 37          lea    0x37(%r11,%rdx,1),%r8
  4004e8:       4c 89 dd                mov    %r11,%rbp
  4004eb:       48 01 d5                add    %rdx,%rbp
  4004ee:       4e 8d 14 63             lea    (%rbx,%r12,2),%r10
  4004f2:       4e 8d 5c 63 5a          lea    0x5a(%rbx,%r12,2),%r11
  4004f7:       4e 8d 74 0e 3a          lea    0x3a(%rsi,%r9,1),%r14
  4004fc:       48 8d 74 4a 5d          lea    0x5d(%rdx,%rcx,2),%rsi
  400501:       4b 8d 94 ac b7 01 00    lea    0x1b7(%r12,%r13,4),%rdx
  400508:       00
  400509:       4d 8d a4 29 93 00 00    lea    0x93(%r9,%rbp,1),%r12
  400510:       00
  400511:       4e 8d 8c 91 d6 01 00    lea    0x1d6(%rcx,%r10,4),%r9
  400518:       00
  400519:       ff c8                   dec    %eax
  40051b:       75 b3                   jne    4004d0 <compute+0x40>

The following is produced by C1. Note that r14 is absent, because the compiler doesn’t know the existence of c14. There are two memory operations: one store operation at 4004e3, one load operation at 400516.

  4004d0:       49 89 ec                mov    %rbp,%r12
  4004d3:       4c 89 fb                mov    %r15,%rbx
  4004d6:       49 8d 0c dc             lea    (%r12,%rbx,8),%rcx
  4004da:       49 8d 2c f0             lea    (%r8,%rsi,8),%rbp
  4004de:       49 8d 7c f0 5f          lea    0x5f(%r8,%rsi,8),%rdi
  4004e3:       48 89 7c 24 f8          mov    %rdi,-0x8(%rsp)
  4004e8:       4d 8d 44 12 37          lea    0x37(%r10,%rdx,1),%r8
  4004ed:       4d 89 d3                mov    %r10,%r11
  4004f0:       49 01 d3                add    %rdx,%r11
  4004f3:       4a 8d 3c 6b             lea    (%rbx,%r13,2),%rdi
  4004f7:       4e 8d 54 6b 5a          lea    0x5a(%rbx,%r13,2),%r10
  4004fc:       4e 8d 7c 0e 3a          lea    0x3a(%rsi,%r9,1),%r15
  400501:       48 8d 74 4a 5d          lea    0x5d(%rdx,%rcx,2),%rsi
  400506:       49 8d 94 ad b7 01 00    lea    0x1b7(%r13,%rbp,4),%rdx
  40050d:       00
  40050e:       4f 8d ac 19 93 00 00    lea    0x93(%r9,%r11,1),%r13
  400515:       00
  400516:       48 8b 6c 24 f8          mov    -0x8(%rsp),%rbp
  40051b:       4c 8d 8c b9 d6 01 00    lea    0x1d6(%rcx,%rdi,4),%r9
  400522:       00
  400523:       ff c8                   dec    %eax
  400525:       75 a9                   jne    4004d0 <compute+0x40>

To measure the performance, we use SPEC CPU 2006 benchmark suit. Below is the result of the test cases.


Let’s look at the result.

  • For most cases, C0 is the fastest, except sjeng and libquantum. Assuming bug-free compiler, there is no way to explain C0 being slower. My guess is experimental error since the variance for the two cases is quite large.
  • C2 has least registers available, and is the slowest for 6 out of 12 cases.
  • C1 is slower than C3 in most cases. This means a callee-saved register is more performance critical than a caller-saved register.
  • The overall overhead for C1 is XXX, C2 is XXX, C3 is XXX. (TODO)

Modifying Android APKs which does self certificate checking


When I modify an Android app which is not mine, I use apktool to unpack the APK, modify the smali, use apktool to pack it and finally sign it using my own key. This works for most of the Apps. However, some Apps explicitly perform certificate checking in their code. The following code snippet illustrates such checking:

PackageManager pm = myContext.getPackageManager();
PackageInfo info = pm.getPackageInfo("", PackageManager.GET_SIGNATURES);
String expectedSig = "308203a5...";
if (!expectedSig.equals(info.signatures[0].toCharsString()) {

One way to fix this is by removing the “if” block, but it’s difficult to locate all such if blocks, especially in smali. A more convenient way is to hook the PackageManager.getPackageInfo() API and change info.signatures to the original APK’s signature. The following code snippet illustrates such hooking:

PackageManager pm = myContext.getPackageManager();
PackageInfo info = PatchSignature.getPackageInfo(pm, "", PackageManager.GET_SIGNATURES);
public class PatchSignature {
  public static PackageInfo getPackageInfo (PackageManager pm, String packageName, int flags) throws PackageManager.NameNotFoundException
    PackageInfo info = pm.getPackageInfo(packageName, flags);
    if ("".equals(packageName) &&
        info.signatures != null && info.signatures.length > 0 &&
        info.signatures[0] != null)
      info.signatures[0] = new Signature("308203b6...");
    return info;

Three things need to be done here:

  1. Change all PackageManager.getPackageInfo() to PatchSignature.getPackageInfo().
  2. Add class PatchSignature.
  3. Find out the correct signature.

It is very easy to locate and replace all PackageManager.getPackageInfo() calls in the smali code. All we need to do is replacing:

invoke-virtual {PARAM1, PARAM2, PARAM3}, Landroid/content/pm/PackageManager;->getPackageInfo(Ljava/lang/String;I)Landroid/content/pm/PackageInfo;


invoke-static {PARAM1, PARAM2, PARAM3}, LPatchSignature;->getPackageInfo(Landroid/content/pm/PackageManager;Ljava/lang/String;I)Landroid/content/pm/PackageInfo;

The smali code for PatchSignature can be easily obtained by writing a simple app with the java code and disassemble it. Now we have only one problem remaining, how to obtain the original certificate? I.e. what should we put in new Signature("...")?

Here’s what I found the most convenient way:

openssl pkcs7 -inform DER -print_certs -in unpacked-app-using-apktool/original/META-INF/CERT.RSA | openssl x509 -inform PEM -outform DER | xxd -p

Why NUS-to-MyRepublic latency is so high?


I have switched from starhub cable to MyRepublic fibre. I noticed the lag when SSH from home to NUS, so I tried to figure it out.

Here is a comparison of the round trip time (RTT) from MyRepublic to various of places I can think of. The measurement spams about one and half days. 5 ping packets is sent to all targets every 20 minutes. Sorry for the limited coloring of the plot. The legend is sorted from low latency on the top to high latency on the bottom. So singnet is the fastest and berkeley is the slowest. NUS, having an average RTT of 189ms, ranks among the slowest three.


Here is a summary in the following table. I got geographic location from

host geo. location RTT Singapore 4.4 Singapore 5.1 Singapore 6.4
google dns California 11.7 Guangdong 42.4 Beijing 81.4 Shanghai 144.6 Beijing 159.6 Singapore 188.8 Massachusetts 191.3 California 223.5

Then I traceroute from MyRepublic to NUS.

[myrepublic]$ traceroute
traceroute to (, 30 hops max, 60 byte packets
 1 (  0.212 ms  0.412 ms  0.495 ms
 2 (  2.737 ms  2.736 ms  2.910 ms
 3 (  6.822 ms  6.837 ms  6.771 ms
 4 (  4.172 ms  4.296 ms  4.154 ms
 5 (  4.809 ms  4.861 ms  4.764 ms
 6 (  189.093 ms  188.329 ms  188.602 ms
 7 (  177.459 ms  178.062 ms  178.173 ms
 8 (  178.442 ms  178.512 ms  178.423 ms
 9  * * *

The route seems OK in the sense that every intermediate host are in Singapore. It’s just the RTT which isn’t right. Could the return route be the problem? Then I traceroute from NUS to MyRepublic. Note that is in MyRepublic.

[nus]$ traceroute
traceroute to (, 30 hops max, 60 byte packets
 1 (  0.339 ms  0.370 ms  0.458 ms
 2 (  0.315 ms  0.376 ms  0.437 ms
 3 (  1.216 ms  1.115 ms  1.114 ms
 4 (  1.231 ms  1.294 ms  1.142 ms
 5 (  1.234 ms  1.330 ms  1.580 ms
 6 (  1.763 ms  1.420 ms  1.421 ms
 7 (  1.934 ms  1.985 ms  2.174 ms
 8 (  2.553 ms  3.706 ms  4.776 ms
 9 (  4.627 ms  4.120 ms  4.155 ms
10 (  212.824 ms  212.802 ms  233.477 ms
11 (  194.115 ms  214.906 ms  194.099 ms
12 (  192.053 ms  191.509 ms  199.569 ms
13 (  191.899 ms  193.382 ms  193.344 ms
14 (  188.949 ms  190.592 ms  189.575 ms
15 (  204.286 ms  193.958 ms  197.660 ms
16  * (  205.614 ms  204.824 ms
17 (  191.739 ms  191.265 ms  193.485 ms
18 (  227.445 ms  188.560 ms  227.428 ms
19 (  240.193 ms  240.195 ms  240.204 ms
20  * * *
21 (  192.013 ms  193.590 ms  193.314 ms
22 (  203.320 ms  195.278 ms  191.638 ms

Notice that in hop 10-11, it goes from NUS to, which is in California. It then goes to, which is in Tokyo. Then it comes back to Singapore. Why it routes like that is beyond my knowledge.

Besides this, another interesting question is that why google dns is so fast. The distance between Singapore and San Francisco is 13572.66 KM. The RTT of light is 90.5ms, so no one can do better than that. But google can do 11.7ms!

It turns out that is an anycast address which means it routes to one of the google DNS servers distributed across the world. Try yourself traceroute to!

Experiment on TCP Hole Punching


I recently need to find a way to connect to a subversion server behind a NAT. I used to tunnel through a SSH server with public IP. It worked perfectly, but recently I lost access to the server. So I want to try TCP hole punching.

It’s not hard to find related resource online. I followed the approach described in the paper “Peer-to-Peer Communication Across Network Address Translators”. The basic idea is to let both peers do connect and listen on the same port. If the internet gateway sees an outgoing SYN packet to X, the gateway will allow subsequent packets from X. As a result, at least one of the SYN packet should punch trough the NAT.

Before this, we need to know the external IP and port of both peers. Fortunately, most NAT implementations always map the same internal IP/port to the same external IP/port. It’s known as “independent mapping”. Even better, most NAT will use the same external port as the internal port if it’s not occupied. It’s known as “port preserving”. To know the external IP/port, we can connect to a third server and let it tell us, just like STUN.

So I implemented the idea in Ford’s paper.

#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <netdb.h>

#define DIE(format,...) do {perror(NULL); printf(format, ##__VA_ARGS__); exit(1);} while(0)

int say_something (int sock)
	char buff[256];
	int len, flags;

	flags = fcntl(sock, F_GETFL);
	flags = flags & (~ O_NONBLOCK);
	if (fcntl(sock, F_SETFL, flags))
		DIE("fcntl() failed\n");

	snprintf(buff, sizeof(buff), "Hello. I'm %d", getpid());
	printf("sending %s\n", buff);
	if (send(sock, buff, strlen(buff) + 1, 0) != strlen(buff) + 1)
		DIE("send() failed\n");

	len = recv(sock, buff, sizeof(buff), 0);
	if (len <= 0)
		DIE("recv() failed\n");
	printf("received %s\n", buff);

	return 0;

// TODO address type, length...
int getaddr (struct sockaddr *addr, const char *host, const char *port)
	struct addrinfo hints, *res;

	memset(&hints, 0, sizeof(hints));
	hints.ai_family = AF_INET;
	hints.ai_socktype = SOCK_STREAM;
	hints.ai_protocol = 0;
	hints.ai_flags = AI_PASSIVE;

	if (getaddrinfo(host, port, &hints, &res))
		return -1;

	if (res == NULL)
		return -1;

	memcpy(addr, res->ai_addr, res->ai_addrlen);
	return 0;

int main (int argc, char *argv[])
	int ssock, csock;
	struct sockaddr_in local_addr, remote_addr;
	fd_set rfds, wfds;
	struct timeval tv;
	int i;
	socklen_t len;

	if (argc != 4) {
		printf("Usage: %s localport remotehost remoteport\n", argv[0]);

	if (getaddr((struct sockaddr *)&local_addr, NULL, argv[1]))
		DIE("getaddr() failed\n");
	if (getaddr((struct sockaddr *)&remote_addr, argv[2], argv[3]))
		DIE("getaddr() failed\n");

	if ((ssock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0)
		DIE("socket() failed\n");
	if ((csock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0)
		DIE("socket() failed\n");

	i = 1;
	if (setsockopt(ssock, SOL_SOCKET, SO_REUSEADDR, &i, sizeof(int)))
		DIE("setsockopt() failed\n");
	if (setsockopt(csock, SOL_SOCKET, SO_REUSEADDR, &i, sizeof(i)))
		DIE("setsockopt() failed\n");

	if (bind(ssock, (const struct sockaddr *)&local_addr, sizeof(local_addr)))
		DIE("bind() failed\n");
	if (bind(csock, (const struct sockaddr *)&local_addr, sizeof(local_addr)))
		DIE("bind() failed\n");

	if (fork()) {

		if (listen(ssock, 1))
			DIE("listen() failed\n");
		while (1) {
			len = sizeof(remote_addr);
			i = accept(ssock, (struct sockaddr *)&remote_addr, &len);
			if (i < 0) {
				perror("accept() failed.");
			} else {
				printf("accept() succeed.");
				return say_something(i);
	} else {

		for (i = 0; i < 3; i ++) {
			if (connect(csock, (const struct sockaddr *)&remote_addr, sizeof(remote_addr))) {
				int sleeptime = random() * 1000000.0 / RAND_MAX + 1000000.0;
				sleeptime = sleeptime << i;
				perror("connect() failed");
				if (i < 2) {
					printf("sleeping for %.2f sec to retry\n", sleeptime / 1000000.0);
			} else {
				printf("connect() succeed");
				return say_something(csock);
		return 1;

It worked. host1 and host2 have external IP and respectively. Both NAT preserve ports so if host1 binds on port 30000, the external port is also 30000.

host1$ ./biconn 30000 20000
connect() failed: Connection timed out
sleeping for 1.13 sec to retry
connect() succeed: Connection timed out
sending Hello. I'm 8151
received Hello. I'm 6629
host2$ ./biconn 20000 30000
connect() failed: Connection refused
sleeping for 1.68 sec to retry
connect() succeed: Connection refused
sending Hello. I'm 6629
received Hello. I'm 8151

I noticed an unexpected behaviour. accept() never succeeded in either peer. connect() succeed in both peers.

Is it possible for two peers to symmetrically “connect()” to each other? Is question is not related to NAT. The answer is yes. Find any computer networks text book and look for the TCP state diagram. It’s possible to go from the SYN_SENT state to the SYN_RECV state by receiving a SYN packet. Someone has asked the question before.

So I wondered if I can remove the listen() part in the code, and use only one socket in each peer. A problem with the previous approach (as mentioned here) is that it’s not possible to bind additional sockets on the port after listen().

So I did the second experiment. It’s much cleaner.

#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/select.h>
#include <netinet/in.h>

void die (const char *msg)

int main (int argc, char *argv[])
	int sock;
	struct sockaddr_in addr;
	char buff[256];

	if (argc != 4) {
		printf("Usage: %s localport remotehost remoteport\n", argv[0]);

	if (sock < 0)
		die("socket() failed");

	memset(&addr, 0, sizeof(addr));
	addr.sin_family = AF_INET;
	addr.sin_addr.s_addr = htonl(INADDR_ANY);
	addr.sin_port = htons(atoi(argv[1]));
	if (bind(sock, (const struct sockaddr *)&addr, sizeof(addr)))
		die("bind() failed\n");

	memset(&addr, 0, sizeof(addr));
	addr.sin_family = AF_INET;
	addr.sin_addr.s_addr = inet_addr(argv[2]);
	addr.sin_port = htons(atoi(argv[3]));

	while (connect(sock, (const struct sockaddr *)&addr, sizeof(addr))) {
		if (errno != ETIMEDOUT) {
			perror("connect() failed. retry in 2 sec.");
		} else {
			perror("connect() failed.");

	snprintf(buff, sizeof(buff), "Hi, I'm %d.", getpid());
	printf("sending \"%s\"\n", buff);
	if (send(sock, buff, strlen(buff) + 1, 0) != strlen(buff) + 1)
		die("send() failed.");

	if (recv(sock, buff, sizeof(buff), 0) <= 0)
		die("recv() failed.");
	printf("received \"%s\"\n", buff);

	return 0;

It works. I wonder what’s the reason of doing listen(). Does it related to the way connection tracking is implemented in different type of NAT? Or does it related to the way TCP is implemented in different OS?

host1$ ./biconn1 20000 30000
connect() failed. retry in 2 sec.: Connection refused
sending "Hi, I'm 6566."
received "Hi, I'm 7600."
host2$ ./biconn1 30000 20000
connect() failed. retry in 2 sec.: Connection refused
connect() failed.: Connection timed out
connect() failed.: Connection timed out
sending "Hi, I'm 7600."
received "Hi, I'm 6566."

My objective is to connect to my subversion server in a NAT. Now, I still need a publicly accessible server to coordinate the hole punching. It basically works like this: In the subversion server I run a program with persistent connection to the public server. When I want to connect from outside, I can contact the public server, which then notifies my program in the subversion server. Then I can launch the TCP hole punching and get a TCP connection, which can then be used to tunnel the subversion connection.

Without possessing a public accessible server, other mechanisms can be used. I can think of the following mechanisms:

  • Online forum: Post the client’s external IP/port in a forum and have a program running in the subversion server to periodically check the forum.
  • DHT, e.g. the mainline bittorrent DHT: The server randomly generates a infohash, and “announce” itself to be downloading this infohash. The server then periodically queries for peers on the infohash. To do hole punching, the client also announces itself to be downloading it. The server sees a new peer joining, then both parties can do hole punching. The limitation is that two peers cannot exchange port information, thus they need to predetermine a particular port.
  • IRC bot
  • Public SIP registrar: It’s a bit overkill, but quite related, and well supported (plenty public servers and libraries).

I’m not sure if there is any existing tool for this purpose. Before IPv6 getting well established, there are going to be more and more servers behind NAT, so this is going to be handy. Please leave a comment if you know any.

Strange problem of Singapore ICA’s SAVE system


I was trying to apply for visa yesterday. After I clicked the “Proceed to submit” button, I always get a blank page. Cleared cookie/cache, still same problem. I thought it’s because too many people applying visa. I tried after mid-night, still same problem. I tried school’s computer today, also same. Tried Firefox, failed at even an earlier step.

Tried my laptop and it succeed. I figured out that the reason of failure is probably adding the site in trusted sites. The web system requires turning off popup blocker for and I did that, and in addition I added both sites as trusted site. This unnecessary step turned out to cause the failure. I guess the reason is probably that when is trusted while singpass site is not, they are in different security zones, and there is problem in their handshake.

I’m a bit curious on why Firefox fails, so I checked the error console. The reason turned out to be the use of location.href(newurl), which Firefox considers as a property, not a method. If I manually types the url into the address bar, I can at least get to the singpass login page, which is one step further than the IE’s problem.

How to steal iPhone ringtone from iTunes shop?


In iTunes shop, all musics and ringtones have 30 seconds preview. Ringtones are always less than 30 seconds. That means the ringtone previews are always in full length. If we can hear it, we can download it (for free, of course). This article is to share on how to download ringtones and add to your iPhone for free. It works for all ringtones in iTune Store. The main intention is to use this as an example to illustrate common practices in network hacking.

First, get a network log of iPhone’s ringtone preview traffic. To do this, setup a sniff-able wifi environment. This might be difficult for some people, but I have an existing environment. All my internet traffic goes through my linux router, so I simply run tcpdump there. I use wireshark to analyze the saved dump file.

When I play a ringtone preview, I see the this request: (Lucky it’s not https. If it’s https, I have to try self signed certificate and see if it can pass the check.) Quickly do a direct wget. I get error 403 forbidden. First reaction is user agent. Change UA to iPhone. succeed.

$ wget -U 'Apple iPhone OS v3.1.3 CoreMedia v1.0.0.7E18' ''
--2010-04-09 01:53:43--
Connecting to||:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 493781 (482K) [text/plain]
Saving to: “mzi.wphahwgb.aac.p.m4p”

100%[=============================================================>] 493,781 256K/s in 1.9s

2010-04-09 01:53:45 (256 KB/s) - “mzi.wphahwgb.aac.p.m4p” saved [493781/493781]

Feed the m4p to a media player. It plays.

Up to here, we can already download the ringtone. Just follow previous steps and get the m4p url. It’s not very convenient though, as we have to use iphone and sniff to get the url. I want to get rid of the iphone step. I want to have a script to download a ringtone given it’s name, or have a script to download top 100 ringtone of a given genre.

For the top 100 script, I found out the url of top ringtone listing by genre to be something like However, the viewTop page requires sign-in. There are two ways to deal with sign-in. The hardworking way is to figure out the sign-in protocol and implement it. Usually it requires posting user id and password and get a session id. The dirty way is to sniff and get the session id, but we can only use the session before it expires. I’m not going into details about this. Here is the wget command to download the page. It’s a bit long because of those special X-apple-* headers. You can’t use it because 1. the session has expired; and 2. I have modified some of those IDs for my privacy. The page is an xml containing titles, artists, purchasing information, preview-url (the most important one for us), user ratings, etc.

wget -O - --header 'X-Apple-Store-Front: 143441-1,2' --header 'X-Apple-Partner: origin.0' --header 'X-Apple-Connection-Type: WiFi' --header 'X-Apple-Cuid: 068c5db16ca2b6956f7d582690613b68' --header 'X-Apple-Software-Cuid: 6a26ef98bfc6b1ef6f00694e61735a64' --header 'X-Dsid: 1369530585' --header 'X-Apple-Client-Application: WiFi-Music' --header 'Cookie: mz_at0=xQQUAABxlwAABABLsXbOCow79QEJcf6OqeR9C9ya+U87hxY=; mzf_in=180805; X-Dsid=1369530585; a=A2dAjgAAABtjAlRWMEsHtXFZZzAvdWlodAFxSTs5Ak1yOTjaSG9lYmtLdWcjKgsQAAdAJ5BiPTt=; Pod=18; s_cvp35b=%5B%5B%27google%253A%2520organic%27%2C%271369276708021%27%5D%2C%5B%27192.168.0.1%253A8000%27%2C%274278433477967%27%5D%5D; s_vi=[CS]v1|25C941528801054F-70001710E0178F3F[CE]; s_vnum_sg=ch%3Dip%26vn%3D1%3B; s_vnum_us=ch%3Dlegal%26vn%3D1%3Bch%3Dwebapps%26vn%3D3%3Bch%3Dip%26vn%3D2%3Bch%3Ddeveloper%26vn%3D1%3B' -U 'iTunes-iPhone/3.1.3 (2)' ''

There are many articles teaching how to add ringtones to iphone. I briefly describe here.

  1. Rename to .m4r
  2. Import (drag) to iTunes. It should appear under the ringtone directory in iTunes. Note: Don’t manually manage music and ringtone. Add to iTunes and sync. I tried the first way and failed miserably. I hate iTunes.
  3. You may want to change the metadata. Alternatively, before importing to iTunes, you can use opensource tools like mp4tags from libmp4v2 to change metadata. I prefer mp4tags, because it works in command line so that I can run in batch.
  4. Sync
  5. You should see the new ringtone in your iPhone.

So is it possible for apple to prevent this? I can think of a few solutions, but none of them work well.

  1. Do not provide preview. Customers won’t be happy.
  2. Add noise to preview. Shorten it to 10 seconds. Customers won’t be so happy.
  3. Use https or a custom protocol. “If we can hear it, we can download it.” It only makes hackers taking longer time. But, hey, hackers are the group of people having least money and most time.

Convert videos from iPhone using ffmpeg on Fedora 11


ffmpeg in fedora 11 doesn’t buildin faac library, which encodes AAC. I need to build my own ffmpeg from source.

  1. yum install lame-devel xvidcore-devel x264-devel faad2-devel faac-devel gsm-devel dirac-devel libogg-devel libtheora-devel speex-devel libvorbis-devel openjpeg-devel liboil-devel schroedinger-devel libraw1394-devel libdc1394-devel bzip2-devel alsa-lib-devel xorg-x11-proto-devel libXau-devel libxcb-devel libXdmcp-devel libX11-devel libvdpau-devel libXext-devel libXv-devel libXvMC-devel
    Some packages are in rpmfusion. You know what you need to do.
  2. download ffmpeg source and extract. I downloaded the latest version 0.5.1.
  3. ./configure --arch=pentium4 --enable-bzlib --enable-libdc1394 --enable-libdirac --enable-libfaad --enable-libgsm --enable-libmp3lame --enable-libopenjpeg --enable-libschroedinger --enable-libspeex --enable-libtheora --enable-libvorbis --enable-libx264 --enable-libxvid --enable-vdpau --enable-x11grab --enable-avfilter --enable-avfilter-lavf --enable-postproc --enable-swscale --enable-pthreads --enable-gpl --disable-stripping --cpu=pentium4 --enable-nonfree --enable-libfaac --prefix=/home/atp/install/ffmpeg-0.5.1
    I followed the configuration of ffmpeg from rpmfusion. The only changes made are:

    • --enable-nonfree --enable-libfaac
    • --prefix=/home/atp/install/ffmpeg-0.5.1(I never install my build using root.)
    • change i586 to pentium4 and removed some gcc options I don’t understand.
    • remove --disable-mmx2 --disable-sse --disable-ssse3 --disable-yasm
    • change to static build
  4. make
    make install

to be continued…

Noise problem with iTunes optimization


I noticed severe image noise after I transferred my 320×480 photos to my iPhone. This is probably to do with the so called “optimization” done by iTunes.

Below is my original image:

Original Image

Original Image

Below is the “processed” image: (How did I get it? Select the image in iPhone and send email.)

Processed Image

Processed Image

Notice the added noise and slightly increased saturation. I tried to google to find out a way to disable the processing. No luck.

I tried to run process monitor on iTunes and found out the optimization is done by iTunesPhotoProcessor.exe. The processed image was saved into a .ithmb file. After a couple of hours, I couldn’t figure out a way to prevent the optimization.

Here is another attempt: Below is a comparison of the two JPEG header information:
ExifTool Version Number : 8.00
File Name : original.jpg
Directory : .
File Size : 41 kB
File Modification Date/Time : 2010:03:09 23:16:14+08:00
File Type : JPEG
MIME Type : image/jpeg
JFIF Version : 1.02
Resolution Unit : None
X Resolution : 100
Y Resolution : 100
Quality : 80%
DCT Encode Version : 100
APP14 Flags 0 : [14], Encoded with Blend=1 downsampling
APP14 Flags 1 : (none)
Color Transform : YCbCr
Image Width : 320
Image Height : 480
Encoding Process : Baseline DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:4:4 (1 1)
Image Size : 320x480

ExifTool Version Number : 8.00
File Name : processed.jpg
Directory : .
File Size : 55 kB
File Modification Date/Time : 2010:03:09 08:25:10+08:00
File Type : JPEG
MIME Type : image/jpeg
JFIF Version : 1.01
Resolution Unit : None
X Resolution : 1
Y Resolution : 1
Image Width : 320
Image Height : 480
Encoding Process : Baseline DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:2:0 (2 2)
Image Size : 320x480

The most suspicious differences are X&Y Resolution and YCbCr. Could any of these be the culprit? For example, I can generate an image with the same parameter as the processed image and hope iTunes will skip the processing. I haven’t tried this method yet…