问题描述:

I have two files one looks like this,

chrom start end ref alt alt_freq_CR alt_freq_ID gene gene_type

chr10 28814618 28814618 T C 0 0.4167 AAA sense

chr10 32557818 32557818 C T 0 0.1579 BBB PC

chr10 119245101 119245101 T C 0 0.1667 ZZZ rega

chr10 119245101 119245101 T C 0 0.1667 CCC sense

file 2

 chrom start end ref alt alt_freq_CR alt_freq_ID gene gene_type

chr10 28814618 28814618 T C 0 0.4167 AAA sense

chr10 32557 32557 C T 0 0.1579 BBB PC

chr10 119245101 119245101 T C 0 0.1667 DDD rega

chr10 119245101 119245101 T C 0 0.1667 EEE sense

I need a output file, if it satisfies following conditions as,

if the column (8th column) gene from file_1 and gene (8th column) from file_2 are matching then it it should print out as following,

chrom start end ref alt alt_freq_CR alt_freq_ID gene gene_type chrom start end ref alt alt_freq_CR alt_freq_ID

chr10 28814618 28814618 T C 0 0.4167 AAA sense chr10 28814618 28814618 T C 0 0.4167

chr10 32557818 32557818 C T 0 0.1579 BBB PC chr10 32557 32557 C T 0 0.1579

I tried using awk for this as following but it didn't print all the matching rows in output..

awk 'FNR==NR{a[$16]=$0;next}{if(b=a[$16]){print b}}' file_1 file_2

It would be really great if there is a solution in python script, Also The two files are of different size , interms of number of rows are different .

网友答案:

How about this:

index = df1.icol(7) == df2.icol(7)
df = pd.concat((df1[index], df2[index]), axis = 1)
网友答案:

You want to use pandas.merge. Read your files into Pandas with pandas.read_csv and join on the 'gene' column. Here's the solution to your example.

In [9]: df1
Out[9]:
   chrom      start        end ref alt  alt_freq_CR  alt_freq_ID gene  \
0  chr10   28814618   28814618   T   C            0       0.4167  AAA
1  chr10   32557818   32557818   C   T            0       0.1579  BBB
2  chr10  119245101  119245101   T   C            0       0.1667  ZZZ
3  chr10  119245101  119245101   T   C            0       0.1667  CCC

  gene_type
0     sense
1        PC
2      rega
3     sense

In [10]: df2
Out[10]:
   chrom      start        end ref alt  alt_freq_CR  alt_freq_ID gene  \
0  chr10   28814618   28814618   T   C            0       0.4167  AAA
1  chr10      32557      32557   C   T            0       0.1579  BBB
2  chr10  119245101  119245101   T   C            0       0.1667  DDD
3  chr10  119245101  119245101   T   C            0       0.1667  EEE

  gene_type
0     sense
1        PC
2      rega
3     sense

In [11]: pd.merge(df1, df2, on='gene')
Out[11]:
  chrom_x   start_x     end_x ref_x alt_x  alt_freq_CR_x  alt_freq_ID_x gene  \
0   chr10  28814618  28814618     T     C              0         0.4167  AAA
1   chr10  32557818  32557818     C     T              0         0.1579  BBB

  gene_type_x chrom_y   start_y     end_y ref_y alt_y  alt_freq_CR_y  \
0       sense   chr10  28814618  28814618     T     C              0
1          PC   chr10     32557     32557     C     T              0

   alt_freq_ID_y gene_type_y
0         0.4167       sense
1         0.1579          PC

As you can see, this operation adds a suffix to columns that share the same name between the two DataFrames. A suffix is required, but you can change the suffix with the suffixes keyword argument:

In [14]: pd.merge(df1, df2, on='gene', suffixes=['_df1', '_df2'])
Out[14]:
  chrom_df1  start_df1   end_df1 ref_df1 alt_df1  alt_freq_CR_df1  \
0     chr10   28814618  28814618       T       C                0
1     chr10   32557818  32557818       C       T                0

   alt_freq_ID_df1 gene gene_type_df1 chrom_df2  start_df2   end_df2 ref_df2  \
0           0.4167  AAA         sense     chr10   28814618  28814618       T
1           0.1579  BBB            PC     chr10      32557     32557       C

  alt_df2  alt_freq_CR_df2  alt_freq_ID_df2 gene_type_df2
0       C                0           0.4167         sense
1       T                0           0.1579            PC

Pandas has an exhaustive guide to merging in its documentation.

相关阅读:
Top